National University of Defense Technology
Abstract:Learning-based 3D shape segmentation is usually formulated as a semantic labeling problem, assuming that all parts of training shapes are annotated with a given set of tags. This assumption, however, is impractical for learning fine-grained segmentation. Although most off-the-shelf CAD models are, by construction, composed of fine-grained parts, they usually miss semantic tags and labeling those fine-grained parts is extremely tedious. We approach the problem with deep clustering, where the key idea is to learn part priors from a shape dataset with fine-grained segmentation but no part labels. Given point sampled 3D shapes, we model the clustering priors of points with a similarity matrix and achieve part segmentation through minimizing a novel low rank loss. To handle highly densely sampled point sets, we adopt a divide-and-conquer strategy. We partition the large point set into a number of blocks. Each block is segmented using a deep-clustering-based part prior network trained in a category-agnostic manner. We then train a graph convolution network to merge the segments of all blocks to form the final segmentation result. Our method is evaluated with a challenging benchmark of fine-grained segmentation, showing state-of-the-art performance.
Abstract:We introduce the concept of geometric stability to the problem of 6D object pose estimation and propose to learn pose inference based on geometrically stable patches extracted from observed 3D point clouds. According to the theory of geometric stability analysis, a minimal set of three planar/cylindrical patches are geometrically stable and determine the full 6DoFs of the object pose. We train a deep neural network to regress 6D object pose based on geometrically stable patch groups via learning both intra-patch geometric features and inter-patch contextual features. A subnetwork is jointly trained to predict per-patch poses. This auxiliary task is a relaxation of the group pose prediction: A single patch cannot determine the full 6DoFs but is able to improve pose accuracy in its corresponding DoFs. Working with patch groups makes our method generalize well for random occlusion and unseen instances. The method is easily amenable to resolve symmetry ambiguities. Our method achieves the state-of-the-art results on public benchmarks compared not only to depth-only but also to RGBD methods. It also performs well in category-level pose estimation.
Abstract:Without a shape-aware response, it is hard to characterize the 3D geometry of a point cloud efficiently with a compact set of kernels. In this paper, we advocate the use of Hausdorff distance as a shape-aware distance measure for calculating point convolutional responses. The technique we present, coined Hausdorff Point Convolution (HPC), is shape-aware. We show that HPC constitutes a powerful point feature learning with a rather compact set of only four types of geometric priors as kernels. We further develop a HPC-based deep neural network (HPC-DNN). Task-specific learning can be achieved by tuning the network weights for combining the shortest distances between input and kernel point sets. We also realize hierarchical feature learning by designing a multi-kernel HPC for multi-scale feature encoding. Extensive experiments demonstrate that HPC-DNN outperforms strong point convolution baselines (e.g., KPConv), achieving 2.8% mIoU performance boost on S3DIS and 1.5% on SemanticKITTI for semantic segmentation task.
Abstract:Autonomous 3D acquisition of outdoor environments poses special challenges. Different from indoor scenes, where the room space is delineated by clear boundaries and separations (e.g., walls and furniture), an outdoor environment is spacious and unbounded (thinking of a campus). Therefore, unlike for indoor scenes where the scanning effort is mainly devoted to the discovery of boundary surfaces, scanning an open and unbounded area requires actively delimiting the extent of scanning region and dynamically planning a traverse path within that region. Thus, for outdoor scenes, we formulate the planning of an energy-efficient autonomous scanning through a discrete-continuous optimization of robot scanning paths. The discrete optimization computes a topological map, through solving an online traveling sales problem (Online TSP), which determines the scanning goals and paths on-the-fly. The dynamic goals are determined as a collection of visit sites with high reward of visibility-to-unknown. A visit graph is constructed via connecting the visit sites with edges weighted by traversing cost. This topological map evolves as the robot scans via deleting outdated sites that are either visited or become rewardless and inserting newly discovered ones. The continuous part optimizes the traverse paths geometrically between two neighboring visit sites via maximizing the information gain of scanning along the paths. The discrete and continuous processes alternate until the traverse cost of the current graph exceeds the remaining energy capacity of the robot. Our approach is evaluated with both synthetic and field tests, demonstrating its effectiveness and advantages over alternatives. The project is at http://vcc.szu.edu.cn/research/2020/Husky, and the codes are available at https://github.com/alualu628628/Autonomous-Outdoor-Scanning-via-Online-Topological-and-Geometric-Path-Optimization.
Abstract:We present a novel attention-based mechanism for learning enhanced point features for tasks such as point cloud classification and segmentation. Our key message is that if the right attention point is selected, then "one point is all you need" -- not a sequence as in a recurrent model and not a pre-selected set as in all prior works. Also, where the attention point is should be learned, from data and specific to the task at hand. Our mechanism is characterized by a new and simple convolution, which combines the feature at an input point with the feature at its associated attention point. We call such a point a directional attention point (DAP), since it is found by adding to the original point an offset vector that is learned by maximizing the task performance in training. We show that our attention mechanism can be easily incorporated into state-of-the-art point cloud classification and segmentation networks. Extensive experiments on common benchmarks such as ModelNet40, ShapeNetPart, and S3DIS demonstrate that our DAP-enabled networks consistently outperform the respective original networks, as well as all other competitive alternatives, including those employing pre-selected sets of attention points.
Abstract:Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the (aggregate) posterior to encourage statistical independence of the latent factors. This approach introduces a trade-off between disentangled representation learning and reconstruction quality since the model does not have enough capacity to learn correlated latent variables that capture detail information present in most image data. To overcome this trade-off, we present a novel multi-stage modelling approach where the disentangled factors are first learned using a preexisting disentangled representation learning method (such as $\beta$-TCVAE); then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables, adding detail information while maintaining conditioning on the previously learned disentangled factors. Taken together, our multi-stage modelling approach results in a single, coherent probabilistic model that is theoretically justified by the principal of D-separation and can be realized with a variety of model classes including likelihood-based models such as variational autoencoders, implicit models such as generative adversarial networks, and tractable models like normalizing flows or mixtures of Gaussians. We demonstrate that our multi-stage model has much higher reconstruction quality than current state-of-the-art methods with equivalent disentanglement performance across multiple standard benchmarks.
Abstract:We present a target-driven navigation system to improve mapless visual navigation in indoor scenes. Our method takes a multi-view observation of a robot and a target as inputs at each time step to provide a sequence of actions that move the robot to the target without relying on odometry or GPS at runtime. The system is learned by optimizing a combinational objective encompassing three key designs. First, we propose that an agent conceives the next observation before making an action decision. This is achieved by learning a variational generative module from expert demonstrations. We then propose predicting static collision in advance, as an auxiliary task to improve safety during navigation. Moreover, to alleviate the training data imbalance problem of termination action prediction, we also introduce a target checking module to differentiate from augmenting navigation policy with a termination action. The three proposed designs all contribute to the improved training data efficiency, static collision avoidance, and navigation generalization performance, resulting in a novel target-driven mapless navigation system. Through experiments on a TurtleBot, we provide evidence that our model can be integrated into a robotic system and navigate in the real world. Videos and models can be found in the supplementary material.
Abstract:We study the problem of symmetry detection of 3D shapes from single-view RGB-D images, where severely missing data renders geometric detection approach infeasible. We propose an end-to-end deep neural network which is able to predict both reflectional and rotational symmetries of 3D objects present in the input RGB-D image. Directly training a deep model for symmetry prediction, however, can quickly run into the issue of overfitting. We adopt a multi-task learning approach. Aside from symmetry axis prediction, our network is also trained to predict symmetry correspondences. In particular, given the 3D points present in the RGB-D image, our network outputs for each 3D point its symmetric counterpart corresponding to a specific predicted symmetry. In addition, our network is able to detect for a given shape multiple symmetries of different types. We also contribute a benchmark of 3D symmetry detection based on single-view RGB-D images. Extensive evaluation on the benchmark demonstrates the strong generalization ability of our method, in terms of high accuracy of both symmetry axis prediction and counterpart estimation. In particular, our method is robust in handling unseen object instances with large variation in shape, multi-symmetry composition, as well as novel object categories.
Abstract:We introduce an end-to-end learnable technique to robustly identify feature edges in 3D point cloud data. We represent these edges as a collection of parametric curves (i.e.,lines, circles, and B-splines). Accordingly, our deep neural network, coined PIE-NET, is trained for parametric inference of edges. The network relies on a "region proposal" architecture, where a first module proposes an over-complete collection of edge and corner points, and a second module ranks each proposal to decide whether it should be considered. We train and evaluate our method on the ABC dataset, a large dataset of CAD models, and compare our results to those produced by traditional (non-learning) processing pipelines, as well as a recent deep learning based edge detector (EC-NET). Our results significantly improve over the state-of-the-art from both a quantitative and qualitative standpoint.
Abstract:We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.