Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hang Zhao

A Universal Semantic-Geometric Representation for Robotic Manipulation

Jun 18, 2023
Tong Zhang, Yingdong Hu, Hanchen Cui, Hang Zhao, Yang Gao

Figure 1 for A Universal Semantic-Geometric Representation for Robotic Manipulation

Figure 2 for A Universal Semantic-Geometric Representation for Robotic Manipulation

Figure 3 for A Universal Semantic-Geometric Representation for Robotic Manipulation

Figure 4 for A Universal Semantic-Geometric Representation for Robotic Manipulation

Robots rely heavily on sensors, especially RGB and depth cameras, to perceive and interact with the world. RGB cameras record 2D images with rich semantic information while missing precise spatial information. On the other side, depth cameras offer critical 3D geometry data but capture limited semantics. Therefore, integrating both modalities is crucial for learning representations for robotic perception and control. However, current research predominantly focuses on only one of these modalities, neglecting the benefits of incorporating both. To this end, we present Semantic-Geometric Representation (SGR), a universal perception module for robotics that leverages the rich semantic information of large-scale pre-trained 2D models and inherits the merits of 3D spatial reasoning. Our experiments demonstrate that SGR empowers the agent to successfully complete a diverse range of simulated and real-world robotic manipulation tasks, outperforming state-of-the-art methods significantly in both single-task and multi-task settings. Furthermore, SGR possesses the unique capability to generalize to novel semantic attributes, setting it apart from the other methods.

Via

Access Paper or Ask Questions

SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Jun 15, 2023
Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, Yue Wang, Hang Zhao, Zhiding Yu, Chen Feng

Figure 1 for SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Figure 2 for SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Figure 3 for SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Figure 4 for SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Semantic scene completion (SSC) is crucial for holistic 3D scene understanding by jointly estimating semantics and geometry from sparse observations. However, progress in SSC, particularly in autonomous driving scenarios, is hindered by the scarcity of high-quality datasets. To overcome this challenge, we introduce SSCBench, a comprehensive benchmark that integrates scenes from widely-used automotive datasets (e.g., KITTI-360, nuScenes, and Waymo). SSCBench follows an established setup and format in the community, facilitating the easy exploration of the camera- and LiDAR-based SSC across various real-world scenarios. We present quantitative and qualitative evaluations of state-of-the-art algorithms on SSCBench and commit to continuously incorporating novel automotive datasets and SSC algorithms to drive further advancements in this field. Our resources are released on https://github.com/ai4ce/SSCBench.

* Submitted to NeurIPS 2023 D&B track

Via

Access Paper or Ask Questions

ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Jun 07, 2023
Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, Hang Zhao

Figure 1 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Figure 2 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Figure 3 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Figure 4 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Large language models (LLMs) with memory are computationally universal. However, mainstream LLMs are not taking full advantage of memory, and the designs are heavily influenced by biological brains. Due to their approximate nature and proneness to the accumulation of errors, conventional neural memory mechanisms cannot support LLMs to simulate complex reasoning. In this paper, we seek inspiration from modern computer architectures to augment LLMs with symbolic memory for complex multi-hop reasoning. Such a symbolic memory framework is instantiated as an LLM and a set of SQL databases, where the LLM generates SQL instructions to manipulate the SQL databases. We validate the effectiveness of the proposed memory framework on a synthetic dataset requiring complex reasoning. The project website is available at https://chatdatabase.github.io/ .

Via

Access Paper or Ask Questions

GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

May 15, 2023
Xiaoyu Tian, Haoxi Ran, Yue Wang, Hang Zhao

Figure 1 for GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Figure 2 for GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Figure 3 for GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

Figure 4 for GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training

This paper tries to address a fundamental question in point cloud self-supervised learning: what is a good signal we should leverage to learn features from point clouds without annotations? To answer that, we introduce a point cloud representation learning framework, based on geometric feature reconstruction. In contrast to recent papers that directly adopt masked autoencoder (MAE) and only predict original coordinates or occupancy from masked point clouds, our method revisits differences between images and point clouds and identifies three self-supervised learning objectives peculiar to point clouds, namely centroid prediction, normal estimation, and curvature prediction. Combined with occupancy prediction, these four objectives yield an nontrivial self-supervised learning task and mutually facilitate models to better reason fine-grained geometry of point clouds. Our pipeline is conceptually simple and it consists of two major steps: first, it randomly masks out groups of points, followed by a Transformer-based point cloud encoder; second, a lightweight Transformer decoder predicts centroid, normal, and curvature for points in each voxel. We transfer the pre-trained Transformer encoder to a downstream peception model. On the nuScene Datset, our model achieves 3.38 mAP improvment for object detection, 2.1 mIoU gain for segmentation, and 1.7 AMOTA gain for multi-object tracking. We also conduct experiments on the Waymo Open Dataset and achieve significant performance improvements over baselines as well.

* Accepted to CVPR 2023

Via

Access Paper or Ask Questions

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

May 03, 2023
Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Tianyuan Yuan, Yue Wang, Yang Yuan, Hang Zhao

Figure 1 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Figure 2 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Figure 3 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Figure 4 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

Via

Access Paper or Ask Questions

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Apr 27, 2023
Xiaoyu Tian, Tao Jiang, Longfei Yun, Yue Wang, Yilun Wang, Hang Zhao

Figure 1 for Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Figure 2 for Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Figure 3 for Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Figure 4 for Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. To overcome these limitations, we introduce a novel task for 3D occupancy prediction, which aims to estimate the detailed occupancy and semantics of objects from multi-view images. To facilitate this task, we develop a label generation pipeline that produces dense, visibility-aware labels for a given scene. This pipeline includes point cloud aggregation, point labeling, and occlusion handling. We construct two benchmarks based on the Waymo Open Dataset and the nuScenes Dataset, resulting in the Occ3D-Waymo and Occ3D-nuScenes benchmarks. Lastly, we propose a model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance in the 3D occupancy prediction task. This approach addresses the need for finer geometric understanding in a coarse-to-fine fashion. The code, data, and benchmarks are released at https://tsinghua-mars-lab.github.io/Occ3D/.

Via

Access Paper or Ask Questions

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Apr 26, 2023
Renhao Wang, Jiayuan Mao, Joy Hsu, Hang Zhao, Jiajun Wu, Yang Gao

Figure 1 for Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Figure 2 for Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Figure 3 for Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Figure 4 for Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose ProgramPort, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{https://progport.github.io}.

* ICLR 2023 camera-ready

Via

Access Paper or Ask Questions

What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging

Apr 26, 2023
Zitian Tang, Wenjie Ye, Wei-Chiu Ma, Hang Zhao

Figure 1 for What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging

Figure 2 for What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging

Figure 3 for What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging

Figure 4 for What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging

Inferring past human motion from RGB images is challenging due to the inherent uncertainty of the prediction problem. Thermal images, on the other hand, encode traces of past human-object interactions left in the environment via thermal radiation measurement. Based on this observation, we collect the first RGB-Thermal dataset for human motion analysis, dubbed Thermal-IM. Then we develop a three-stage neural network model for accurate past human pose estimation. Comprehensive experiments show that thermal cues significantly reduce the ambiguities of this task, and the proposed model achieves remarkable performance. The dataset is available at https://github.com/ZitianTang/Thermal-IM.

Via

Access Paper or Ask Questions

Neural Map Prior for Autonomous Driving

Apr 17, 2023
Xuan Xiong, Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, Hang Zhao

Figure 1 for Neural Map Prior for Autonomous Driving

Figure 2 for Neural Map Prior for Autonomous Driving

Figure 3 for Neural Map Prior for Autonomous Driving

Figure 4 for Neural Map Prior for Autonomous Driving

High-definition (HD) semantic maps are crucial for autonomous vehicles navigating urban environments. Traditional offline HD maps, created through labor-intensive manual annotation processes, are both costly and incapable of accommodating timely updates. Recently, researchers have proposed inferring local maps based on online sensor observations; however, this approach is constrained by the sensor perception range and is susceptible to occlusions. In this work, we propose Neural Map Prior (NMP), a neural representation of global maps that facilitates automatic global map updates and improves local map inference performance. To incorporate the strong map prior into local map inference, we employ cross-attention that dynamically captures correlations between current features and prior features. For updating the global neural map prior, we use a learning-based fusion module to guide the network in fusing features from previous traversals. This design allows the network to capture a global neural map prior during sequential online map predictions. Experimental results on the nuScenes dataset demonstrate that our framework is highly compatible with various map segmentation and detection architectures and considerably strengthens map prediction performance, even under adverse weather conditions and across longer horizons. To the best of our knowledge, this represents the first learning-based system for constructing a global map prior.

* CVPR 2023 Camera Ready

Via

Access Paper or Ask Questions

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Mar 30, 2023
Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, Song Han

Figure 1 for SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Figure 2 for SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Figure 3 for SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Figure 4 for SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.

* CVPR 2023. The first two authors contributed equally to this work. Project page: https://sparsevit.mit.edu

Via

Access Paper or Ask Questions