Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haomeng Zhang

Auto-Vocabulary 3D Object Detection

Dec 18, 2025

Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh

Figure 1 for Auto-Vocabulary 3D Object Detection

Figure 2 for Auto-Vocabulary 3D Object Detection

Figure 3 for Auto-Vocabulary 3D Object Detection

Figure 4 for Auto-Vocabulary 3D Object Detection

Abstract:Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

* technical report

Via

Access Paper or Ask Questions

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Oct 29, 2024

Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh

Figure 1 for Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Figure 2 for Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Figure 3 for Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Figure 4 for Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Abstract:Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Hyperspherical Embedding for Point Cloud Completion

Jul 11, 2023

Junming Zhang, Haomeng Zhang, Ram Vasudevan, Matthew Johnson-Roberson

Abstract:Most real-world 3D measurements from depth sensors are incomplete, and to address this issue the point cloud completion task aims to predict the complete shapes of objects from partial observations. Previous works often adapt an encoder-decoder architecture, where the encoder is trained to extract embeddings that are used as inputs to generate predictions from the decoder. However, the learned embeddings have sparse distribution in the feature space, which leads to worse generalization results during testing. To address these problems, this paper proposes a hyperspherical module, which transforms and normalizes embeddings from the encoder to be on a unit hypersphere. With the proposed module, the magnitude and direction of the output hyperspherical embedding are decoupled and only the directional information is optimized. We theoretically analyze the hyperspherical embedding and show that it enables more stable training with a wider range of learning rates and more compact embedding distributions. Experiment results show consistent improvement of point cloud completion in both single-task and multi-task learning, which demonstrates the effectiveness of the proposed method.

Via

Access Paper or Ask Questions