Alert button
Picture for Mingyang Li

Mingyang Li

Alert button

NeRF-VINS: A Real-time Neural Radiance Field Map-based Visual-Inertial Navigation System

Sep 17, 2023
Saimouli Katragadda, Woosik Lee, Yuxiang Peng, Patrick Geneva, Chuchu Chen, Chao Guo, Mingyang Li, Guoquan Huang

Achieving accurate, efficient, and consistent localization within an a priori environment map remains a fundamental challenge in robotics and computer vision. Conventional map-based keyframe localization often suffers from sub-optimal viewpoints due to limited field of view (FOV), thus degrading its performance. To address this issue, in this paper, we design a real-time tightly-coupled Neural Radiance Fields (NeRF)-aided visual-inertial navigation system (VINS), termed NeRF-VINS. By effectively leveraging NeRF's potential to synthesize novel views, essential for addressing limited viewpoints, the proposed NeRF-VINS optimally fuses IMU and monocular image measurements along with synthetically rendered images within an efficient filter-based framework. This tightly coupled integration enables 3D motion tracking with bounded error. We extensively compare the proposed NeRF-VINS against the state-of-the-art methods that use prior map information, which is shown to achieve superior performance. We also demonstrate the proposed method is able to perform real-time estimation at 15 Hz, on a resource-constrained Jetson AGX Orin embedded platform with impressive accuracy.

* 6 pages, 7 figures 
Viaarxiv icon

Closed-Loop Transcription via Convolutional Sparse Coding

Feb 18, 2023
Xili Dai, Ke Chen, Shengbang Tong, Jingyuan Zhang, Xingjian Gao, Mingyang Li, Druv Pai, Yuexiang Zhai, XIaojun Yuan, Heung-Yeung Shum, Lionel M. Ni, Yi Ma

Figure 1 for Closed-Loop Transcription via Convolutional Sparse Coding
Figure 2 for Closed-Loop Transcription via Convolutional Sparse Coding
Figure 3 for Closed-Loop Transcription via Convolutional Sparse Coding
Figure 4 for Closed-Loop Transcription via Convolutional Sparse Coding

Autoencoding has achieved great empirical success as a framework for learning generative models for natural images. Autoencoders often use generic deep networks as the encoder or decoder, which are difficult to interpret, and the learned representations lack clear structure. In this work, we make the explicit assumption that the image distribution is generated from a multi-stage sparse deconvolution. The corresponding inverse map, which we use as an encoder, is a multi-stage convolution sparse coding (CSC), with each stage obtained from unrolling an optimization algorithm for solving the corresponding (convexified) sparse coding program. To avoid computational difficulties in minimizing distributional distance between the real and generated images, we utilize the recent closed-loop transcription (CTRL) framework that optimizes the rate reduction of the learned sparse representations. Conceptually, our method has high-level connections to score-matching methods such as diffusion models. Empirically, our framework demonstrates competitive performance on large-scale datasets, such as ImageNet-1K, compared to existing autoencoding and generative methods under fair conditions. Even with simpler networks and fewer computational resources, our method demonstrates high visual quality in regenerated images. More surprisingly, the learned autoencoder performs well on unseen datasets. Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, and scalability to large datasets. Our method is arguably the first to demonstrate that a concatenation of multiple convolution sparse coding/decoding layers leads to an interpretable and effective autoencoder for modeling the distribution of large-scale natural image datasets.

* 20 pages 
Viaarxiv icon

Unsupervised Learning of Structured Representations via Closed-Loop Transcription

Oct 30, 2022
Shengbang Tong, Xili Dai, Yubei Chen, Mingyang Li, Zengyi Li, Brent Yi, Yann LeCun, Yi Ma

Figure 1 for Unsupervised Learning of Structured Representations via Closed-Loop Transcription
Figure 2 for Unsupervised Learning of Structured Representations via Closed-Loop Transcription
Figure 3 for Unsupervised Learning of Structured Representations via Closed-Loop Transcription
Figure 4 for Unsupervised Learning of Structured Representations via Closed-Loop Transcription

This paper proposes an unsupervised method for learning a unified representation that serves both discriminative and generative purposes. While most existing unsupervised learning approaches focus on a representation for only one of these two goals, we show that a unified representation can enjoy the mutual benefits of having both. Such a representation is attainable by generalizing the recently proposed \textit{closed-loop transcription} framework, known as CTRL, to the unsupervised setting. This entails solving a constrained maximin game over a rate reduction objective that expands features of all samples while compressing features of augmentations of each sample. Through this process, we see discriminative low-dimensional structures emerge in the resulting representations. Under comparable experimental conditions and network complexities, we demonstrate that these structured representations enable classification performance close to state-of-the-art unsupervised discriminative representations, and conditionally generated image quality significantly higher than that of state-of-the-art unsupervised generative models. Source code can be found at https://github.com/Delay-Xili/uCTRL.

* 17 pages 
Viaarxiv icon

Revisiting Sparse Convolutional Model for Visual Recognition

Oct 24, 2022
Xili Dai, Mingyang Li, Pengyuan Zhai, Shengbang Tong, Xingjian Gao, Shao-Lun Huang, Zhihui Zhu, Chong You, Yi Ma

Figure 1 for Revisiting Sparse Convolutional Model for Visual Recognition
Figure 2 for Revisiting Sparse Convolutional Model for Visual Recognition
Figure 3 for Revisiting Sparse Convolutional Model for Visual Recognition
Figure 4 for Revisiting Sparse Convolutional Model for Visual Recognition

Despite strong empirical performance for image classification, deep neural networks are often regarded as ``black boxes'' and they are difficult to interpret. On the other hand, sparse convolutional models, which assume that a signal can be expressed by a linear combination of a few elements from a convolutional dictionary, are powerful tools for analyzing natural images with good theoretical interpretability and biological plausibility. However, such principled models have not demonstrated competitive performance when compared with empirically designed deep networks. This paper revisits the sparse convolutional modeling for image classification and bridges the gap between good empirical performance (of deep learning) and good interpretability (of sparse convolutional models). Our method uses differentiable optimization layers that are defined from convolutional sparse coding as drop-in replacements of standard convolutional layers in conventional deep neural networks. We show that such models have equally strong empirical performance on CIFAR-10, CIFAR-100, and ImageNet datasets when compared to conventional neural networks. By leveraging stable recovery property of sparse modeling, we further show that such models can be much more robust to input corruptions as well as adversarial perturbations in testing through a simple proper trade-off between sparse regularization and data reconstruction terms. Source code can be found at https://github.com/Delay-Xili/SDNet.

* 17 pages. Accepted by NeurIPS2022 
Viaarxiv icon

SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud

Aug 03, 2022
Xiangrui Zhao, Sheng Yang, Tianxin Huang, Jun Chen, Teng Ma, Mingyang Li, Yong Liu

Figure 1 for SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud
Figure 2 for SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud
Figure 3 for SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud
Figure 4 for SuperLine3D: Self-supervised Line Segmentation and Description for LiDAR Point Cloud

Poles and building edges are frequently observable objects on urban roads, conveying reliable hints for various computer vision tasks. To repetitively extract them as features and perform association between discrete LiDAR frames for registration, we propose the first learning-based feature segmentation and description model for 3D lines in LiDAR point cloud. To train our model without the time consuming and tedious data labeling process, we first generate synthetic primitives for the basic appearance of target lines, and build an iterative line auto-labeling process to gradually refine line labels on real LiDAR scans. Our segmentation model can extract lines under arbitrary scale perturbations, and we use shared EdgeConv encoder layers to train the two segmentation and descriptor heads jointly. Base on the model, we can build a highly-available global registration module for point cloud registration, in conditions without initial transformation hints. Experiments have demonstrated that our line-based registration method is highly competitive to state-of-the-art point-based approaches. Our code is available at https://github.com/zxrzju/SuperLine3D.git.

* 17 pages, ECCV 2022 Accepted 
Viaarxiv icon

DuMLP-Pin: A Dual-MLP-dot-product Permutation-invariant Network for Set Feature Extraction

Mar 08, 2022
Jiajun Fei, Ziyu Zhu, Wenlei Liu, Zhidong Deng, Mingyang Li, Huanjun Deng, Shuo Zhang

Figure 1 for DuMLP-Pin: A Dual-MLP-dot-product Permutation-invariant Network for Set Feature Extraction
Figure 2 for DuMLP-Pin: A Dual-MLP-dot-product Permutation-invariant Network for Set Feature Extraction
Figure 3 for DuMLP-Pin: A Dual-MLP-dot-product Permutation-invariant Network for Set Feature Extraction
Figure 4 for DuMLP-Pin: A Dual-MLP-dot-product Permutation-invariant Network for Set Feature Extraction

Existing permutation-invariant methods can be divided into two categories according to the aggregation scope, i.e. global aggregation and local one. Although the global aggregation methods, e. g., PointNet and Deep Sets, get involved in simpler structures, their performance is poorer than the local aggregation ones like PointNet++ and Point Transformer. It remains an open problem whether there exists a global aggregation method with a simple structure, competitive performance, and even much fewer parameters. In this paper, we propose a novel global aggregation permutation-invariant network based on dual MLP dot-product, called DuMLP-Pin, which is capable of being employed to extract features for set inputs, including unordered or unstructured pixel, attribute, and point cloud data sets. We strictly prove that any permutation-invariant function implemented by DuMLP-Pin can be decomposed into two or more permutation-equivariant ones in a dot-product way as the cardinality of the given input set is greater than a threshold. We also show that the DuMLP-Pin can be viewed as Deep Sets with strong constraints under certain conditions. The performance of DuMLP-Pin is evaluated on several different tasks with diverse data sets. The experimental results demonstrate that our DuMLP-Pin achieves the best results on the two classification problems for pixel sets and attribute sets. On both the point cloud classification and the part segmentation, the accuracy of DuMLP-Pin is very close to the so-far best-performing local aggregation method with only a 1-2% difference, while the number of required parameters is significantly reduced by more than 85% in classification and 69% in segmentation, respectively. The code is publicly available on https://github.com/JaronTHU/DuMLP-Pin.

* 16 pages, accepted by AAAI 2022, with technical appendix 
Viaarxiv icon

Translation Invariant Global Estimation of Heading Angle Using Sinogram of LiDAR Point Cloud

Mar 02, 2022
Xiaqing Ding, Xuecheng Xu, Sha Lu, Yanmei Jiao, Mengwen Tan, Rong Xiong, Huanjun Deng, Mingyang Li, Yue Wang

Figure 1 for Translation Invariant Global Estimation of Heading Angle Using Sinogram of LiDAR Point Cloud
Figure 2 for Translation Invariant Global Estimation of Heading Angle Using Sinogram of LiDAR Point Cloud
Figure 3 for Translation Invariant Global Estimation of Heading Angle Using Sinogram of LiDAR Point Cloud
Figure 4 for Translation Invariant Global Estimation of Heading Angle Using Sinogram of LiDAR Point Cloud

Global point cloud registration is an essential module for localization, of which the main difficulty exists in estimating the rotation globally without initial value. With the aid of gravity alignment, the degree of freedom in point cloud registration could be reduced to 4DoF, in which only the heading angle is required for rotation estimation. In this paper, we propose a fast and accurate global heading angle estimation method for gravity-aligned point clouds. Our key idea is that we generate a translation invariant representation based on Radon Transform, allowing us to solve the decoupled heading angle globally with circular cross-correlation. Besides, for heading angle estimation between point clouds with different distributions, we implement this heading angle estimator as a differentiable module to train a feature extraction network end- to-end. The experimental results validate the effectiveness of the proposed method in heading angle estimation and show better performance compared with other methods.

* Paper accepted in the 2022 IEEE International Conference on Robotics and Automation (ICRA) 
Viaarxiv icon

Incremental Learning of Structured Memory via Closed-Loop Transcription

Feb 14, 2022
Shengbang Tong, Xili Dai, Ziyang Wu, Mingyang Li, Brent Yi, Yi Ma

Figure 1 for Incremental Learning of Structured Memory via Closed-Loop Transcription
Figure 2 for Incremental Learning of Structured Memory via Closed-Loop Transcription
Figure 3 for Incremental Learning of Structured Memory via Closed-Loop Transcription
Figure 4 for Incremental Learning of Structured Memory via Closed-Loop Transcription

This work proposes a minimal computational model for learning a structured memory of multiple object classes in an incremental setting. Our approach is based on establishing a closed-loop transcription between multiple classes and their corresponding subspaces, known as a linear discriminative representation, in a low-dimensional feature space. Our method is both simpler and more efficient than existing approaches to incremental learning, in terms of model size, storage, and computation: it requires only a single, fixed-capacity autoencoding network with a feature space that is used for both discriminative and generative purposes. All network parameters are optimized simultaneously without architectural manipulations, by solving a constrained minimax game between the encoding and decoding maps over a single rate reduction-based objective. Experimental results show that our method can effectively alleviate catastrophic forgetting, achieving significantly better performance than prior work for both generative and discriminative purposes.

* 13 pages 
Viaarxiv icon

Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction

Nov 12, 2021
Xili Dai, Shengbang Tong, Mingyang Li, Ziyang Wu, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Michael Psenka, Xiaojun Yuan, Heung Yeung Shum, Yi Ma

Figure 1 for Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
Figure 2 for Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
Figure 3 for Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
Figure 4 for Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction

This work proposes a new computational framework for learning an explicit generative model for real-world datasets. In particular we propose to learn {\em a closed-loop transcription} between a multi-class multi-dimensional data distribution and a { linear discriminative representation (LDR)} in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as the equilibrium point of a {\em two-player minimax game between the encoder and decoder}. A natural utility function for this game is the so-called {\em rate reduction}, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a {\em both discriminative and generative} representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and often better than existing methods based on GAN, VAE, or a combination of both. We notice that the so learned features of different classes are explicitly mapped onto approximately {\em independent principal subspaces} in the feature space; and diverse visual attributes within each class are modeled by the {\em independent principal components} within each subspace.

* 37 pages 
Viaarxiv icon

CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization

Oct 21, 2021
Wenzheng Hu, Ning Liu, Zhengping Che, Mingyang Li, Jian Tang, Changshui Zhang, Jianqiang Wang

Figure 1 for CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization
Figure 2 for CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization
Figure 3 for CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization
Figure 4 for CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization

Deep convolutional neural networks are shown to be overkill with high parametric and computational redundancy in many application scenarios, and an increasing number of works have explored model pruning to obtain lightweight and efficient networks. However, most existing pruning approaches are driven by empirical heuristics and rarely consider the joint impact of channels, leading to unguaranteed and suboptimal performance. In this paper, we propose a novel channel pruning method via class-aware trace ratio optimization (CATRO) to reduce the computational burden and accelerate the model inference. Utilizing class information from a few samples, CATRO measures the joint impact of multiple channels by feature space discriminations and consolidates the layer-wise impact of preserved channels. By formulating channel pruning as a submodular set function maximization problem, CATRO solves it efficiently via a two-stage greedy iterative optimization procedure. More importantly, we present theoretical justifications on convergence and performance of CATRO. Experimental results demonstrate that CATRO achieves higher accuracy with similar computation cost or lower computation cost with similar accuracy than other state-of-the-art channel pruning algorithms. In addition, because of its class-aware property, CATRO is suitable to prune efficient networks adaptively for various classification subtasks, enhancing handy deployment and usage of deep networks in real-world applications.

Viaarxiv icon