Abstract:Optical flow estimation aims to find the 2D motion field by identifying corresponding pixels between two images. Despite the tremendous progress of deep learning-based optical flow methods, it remains a challenge to accurately estimate large displacements with motion blur. This is mainly because the correlation volume, the basis of pixel matching, is computed as the dot product of the convolutional features of the two images. The locality of convolutional features makes the computed correlations susceptible to various noises. On large displacements with motion blur, noisy correlations could cause severe errors in the estimated flow. To overcome this challenge, we propose a new architecture "CRoss-Attentional Flow Transformer" (CRAFT), aiming to revitalize the correlation volume computation. In CRAFT, a Semantic Smoothing Transformer layer transforms the features of one frame, making them more global and semantically stable. In addition, the dot-product correlations are replaced with transformer Cross-Frame Attention. This layer filters out feature noises through the Query and Key projections, and computes more accurate correlations. On Sintel (Final) and KITTI (foreground) benchmarks, CRAFT has achieved new state-of-the-art performance. Moreover, to test the robustness of different models on large motions, we designed an image shifting attack that shifts input images to generate large artificial motions. Under this attack, CRAFT performs much more robustly than two representative methods, RAFT and GMA. The code of CRAFT is is available at https://github.com/askerlee/craft.
Abstract:In this paper we present SA-CNN, a hierarchical and lightweight self-attention based encoding and decoding architecture for representation learning of point cloud data. The proposed SA-CNN introduces convolution and transposed convolution stacks to capture and generate contextual information among unordered 3D points. Following conventional hierarchical pipeline, the encoding process extracts feature in local-to-global manner, while the decoding process generates feature and point cloud in coarse-to-fine, multi-resolution stages. We demonstrate that SA-CNN is capable of a wide range of applications, namely classification, part segmentation, reconstruction, shape retrieval, and unsupervised classification. While achieving the state-of-the-art or comparable performance in the benchmarks, SA-CNN maintains its model complexity several order of magnitude lower than the others. In term of qualitative results, we visualize the multi-stage point cloud reconstructions and latent walks on rigid objects as well as deformable non-rigid human and robot models.
Abstract:Point cloud instance segmentation has achieved huge progress with the emergence of deep learning. However, these methods are usually data-hungry with expensive and time-consuming dense point cloud annotations. To alleviate the annotation cost, unlabeled or weakly labeled data is still less explored in the task. In this paper, we introduce the first semi-supervised point cloud instance segmentation framework (SPIB) using both labeled and unlabelled bounding boxes as supervision. To be specific, our SPIB architecture involves a two-stage learning procedure. For stage one, a bounding box proposal generation network is trained under a semi-supervised setting with perturbation consistency regularization (SPCR). The regularization works by enforcing an invariance of the bounding box predictions over different perturbations applied to the input point clouds, to provide self-supervision for network learning. For stage two, the bounding box proposals with SPCR are grouped into some subsets, and the instance masks are mined inside each subset with a novel semantic propagation module and a property consistency graph module. Moreover, we introduce a novel occupancy ratio guided refinement module to refine the instance masks. Extensive experiments on the challenging ScanNet v2 dataset demonstrate our method can achieve competitive performance compared with the recent fully-supervised methods.
Abstract:There has been an emerging paradigm shift from the era of "internet AI" to "embodied AI", whereby AI algorithms and agents no longer simply learn from datasets of images, videos or text curated primarily from the internet. Instead, they learn through embodied physical interactions with their environments, whether real or simulated. Consequently, there has been substantial growth in the demand for embodied AI simulators to support a diversity of embodied AI research tasks. This growing interest in embodied AI is beneficial to the greater pursuit of artificial general intelligence, but there is no contemporary and comprehensive survey of this field. This paper comprehensively surveys state-of-the-art embodied AI simulators and research, mapping connections between these. By benchmarking nine state-of-the-art embodied AI simulators in terms of seven features, this paper aims to understand the simulators in their provision for use in embodied AI research. Finally, based upon the simulators and a pyramidal hierarchy of embodied AI research tasks, this paper surveys the main research tasks in embodied AI -- visual exploration, visual navigation and embodied question answering (QA), covering the state-of-the-art approaches, evaluation and datasets.
Abstract:Learning-based methods have been used to pro-gram robotic tasks in recent years. However, extensive training is usually required not only for the initial task learning but also for generalizing the learned model to the same task but in different environments. In this paper, we propose a novel Deep Reinforcement Learning algorithm for efficient task generalization and environment adaptation in the robotic task learning problem. The proposed method is able to efficiently generalize the previously learned task by model fusion to solve the environment adaptation problem. The proposed Deep Model Fusion (DMF) method reuses and combines the previously trained model to improve the learning efficiency and results.Besides, we also introduce a Multi-objective Guided Reward(MGR) shaping technique to further improve training efficiency.The proposed method was benchmarked with previous methods in various environments to validate its effectiveness.
Abstract:6D object pose estimation is widely applied in robotic tasks such as grasping and manipulation. Prior methods using RGB-only images are vulnerable to heavy occlusion and poor illumination, so it is important to complement them with depth information. However, existing methods using RGB-D data don't adequately exploit consistent and complementary information between two modalities. In this paper, we present a novel method to effectively consider the correlation within and across RGB and depth modalities with attention mechanism to learn discriminative multi-modal features. Then, effective fusion strategies for intra- and inter-correlation modules are explored to ensure efficient information flow between RGB and depth. To the best of our knowledge, this is the first work to explore effective intra- and inter-modality fusion in 6D pose estimation and experimental results show that our method can help achieve the state-of-the-art performance on LineMOD and YCB-Video datasets as well as benefit robot grasping task.
Abstract:Graph-based clustering has shown promising performance in many tasks. A key step of graph-based approach is the similarity graph construction. In general, learning graph in kernel space can enhance clustering accuracy due to the incorporation of nonlinearity. However, most existing kernel-based graph learning mechanisms is not similarity-preserving, hence leads to sub-optimal performance. To overcome this drawback, we propose a more discriminative graph learning method which can preserve the pairwise similarities between samples in an adaptive manner for the first time. Specifically, we require the learned graph be close to a kernel matrix, which serves as a measure of similarity in raw data. Moreover, the structure is adaptively tuned so that the number of connected components of the graph is exactly equal to the number of clusters. Finally, our method unifies clustering and graph learning which can directly obtain cluster indicators from the graph itself without performing further clustering step. The effectiveness of this approach is examined on both single and multiple kernel learning scenarios in several datasets.
Abstract:A large amount of annotated training images is critical for training accurate and robust deep network models but the collection of a large amount of annotated training images is often time-consuming and costly. Image synthesis alleviates this constraint by generating annotated training images automatically by machines which has attracted increasing interest in the recent deep learning research. We develop an innovative image synthesis technique that composes annotated training images by realistically embedding foreground objects of interest (OOI) into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. The first is context-aware semantic coherence which ensures that the OOI are placed around semantically coherent regions within the background image. The second is harmonious appearance adaptation which ensures that the embedded OOI are agreeable to the surrounding background from both geometry alignment and appearance realism. The proposed technique has been evaluated over two related but very different computer vision challenges, namely, scene text detection and scene text recognition. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique - the use of our synthesized images in deep network training is capable of achieving similar or even better scene text detection and scene text recognition performance as compared with using real images.
Abstract:Recent advances in generative adversarial networks (GANs) have shown great potentials in realistic image synthesis whereas most existing works address synthesis realism in either appearance space or geometry space but few in both. This paper presents an innovative Spatial Fusion GAN (SF-GAN) that combines a geometry synthesizer and an appearance synthesizer to achieve synthesis realism in both geometry and appearance spaces. The geometry synthesizer learns contextual geometries of background images and transforms and places foreground objects into the background images unanimously. The appearance synthesizer adjusts the color, brightness and styles of the foreground objects and embeds them into background images harmoniously, where a guided filter is introduced for detail preserving. The two synthesizers are inter-connected as mutual references which can be trained end-to-end without supervision. The SF-GAN has been evaluated in two tasks: (1) realistic scene text image synthesis for training better recognition models; (2) glass and hat wearing for realistic matching glasses and hats with real portraits. Qualitative and quantitative comparisons with the state-of-the-art demonstrate the superiority of the proposed SF-GAN.
Abstract:Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention strategies.