Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianwei Zhang

additional authors not shown

PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

Oct 15, 2024

Shang-Ching Liu, Van Nhiem Tran, Wenkai Chen, Wei-Lun Cheng, Yen-Lin Huang, I-Bin Liao, Yung-Hui Li, Jianwei Zhang

Figure 1 for PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

Figure 2 for PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

Figure 3 for PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

Figure 4 for PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

Abstract:Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world. Although Visual Language Models (VLMs) have excelled in high-level reasoning and long-horizon planning for robotic manipulation, they still fall short in grasping the nuanced physical properties required for effective human-robot interaction. In this paper, we introduce PAVLM (Point cloud Affordance Vision-Language Model), an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud. PAVLM integrates a geometric-guided propagation module with hidden embeddings from large language models (LLMs) to enrich visual semantics. On the language side, we prompt Llama-3.1 models to generate refined context-aware text, augmenting the instructional input with deeper semantic cues. Experimental results on the 3D-AffordanceNet benchmark demonstrate that PAVLM outperforms baseline methods for both full and partial point clouds, particularly excelling in its generalization to novel open-world affordance tasks of 3D objects. For more information, visit our project site: pavlm-source.github.io.

Via

Access Paper or Ask Questions

Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection

Oct 07, 2024

Jiuzheng Yang, Song Tang, Yangkuiyi Zhang, Shuaifeng Li, Mao Ye, Jianwei Zhang, Xiatian Zhu

Figure 1 for Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection

Figure 2 for Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection

Figure 3 for Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection

Figure 4 for Rethinking Weak-to-Strong Augmentation in Source-Free Domain Adaptive Object Detection

Abstract:Source-Free domain adaptive Object Detection (SFOD) aims to transfer a detector (pre-trained on source domain) to new unlabelled target domains. Current SFOD methods typically follow the Mean Teacher framework, where weak-to-strong augmentation provides diverse and sharp contrast for self-supervised learning. However, this augmentation strategy suffers from an inherent problem called crucial semantics loss: Due to random, strong disturbance, strong augmentation is prone to losing typical visual components, hindering cross-domain feature extraction. To address this thus-far ignored limitation, this paper introduces a novel Weak-to-Strong Contrastive Learning (WSCoL) approach. The core idea is to distill semantics lossless knowledge in the weak features (from the weak/teacher branch) to guide the representation learning upon the strong features (from the strong/student branch). To achieve this, we project the original features into a shared space using a mapping network, thereby reducing the bias between the weak and strong features. Meanwhile, a weak features-guided contrastive learning is performed in a weak-to-strong manner alternatively. Specifically, we first conduct an adaptation-aware prototype-guided clustering on the weak features to generate pseudo labels for corresponding strong features matched through proposals. Sequentially, we identify positive-negative samples based on the pseudo labels and perform cross-category contrastive learning on the strong features where an uncertainty estimator encourages adaptive background contrast. Extensive experiments demonstrate that WSCoL yields new state-of-the-art performance, offering a built-in mechanism mitigating crucial semantics loss for traditional Mean Teacher framework. The code and data will be released soon.

Via

Access Paper or Ask Questions

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Sep 30, 2024

Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou

Figure 1 for Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Figure 2 for Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Figure 3 for Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Figure 4 for Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Abstract:Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime. The rotation operation can narrow the gap between spike outliers and normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

Via

Access Paper or Ask Questions

ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Sep 13, 2024

Kaixin Bai, Huajian Zeng, Lei Zhang, Yiwen Liu, Hongli Xu, Zhaopeng Chen, Jianwei Zhang

Figure 1 for ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Figure 2 for ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Figure 3 for ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Figure 4 for ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Abstract:Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at https://sites.google.com/view/cleardepth/ .

* 7 pages, 7 figures

Via

Access Paper or Ask Questions

Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments

Jul 17, 2024

Runfa Chen, Ling Wang, Yu Du, Tianrui Xue, Fuchun Sun, Jianwei Zhang, Wenbing Huang

Figure 1 for Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments

Figure 2 for Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments

Figure 3 for Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments

Figure 4 for Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments

Abstract:Learning policies for multi-entity systems in 3D environments is far more complicated against single-entity scenarios, due to the exponential expansion of the global state space as the number of entities increases. One potential solution of alleviating the exponential complexity is dividing the global space into independent local views that are invariant to transformations including translations and rotations. To this end, this paper proposes Subequivariant Hierarchical Neural Networks (SHNN) to facilitate multi-entity policy learning. In particular, SHNN first dynamically decouples the global space into local entity-level graphs via task assignment. Second, it leverages subequivariant message passing over the local entity-level graphs to devise local reference frames, remarkably compressing the representation redundancy, particularly in gravity-affected environments. Furthermore, to overcome the limitations of existing benchmarks in capturing the subtleties of multi-entity systems under the Euclidean symmetry, we propose the Multi-entity Benchmark (MEBEN), a new suite of environments tailored for exploring a wide range of multi-entity reinforcement learning. Extensive experiments demonstrate significant advancements of SHNN on the proposed benchmarks compared to existing methods. Comprehensive ablations are conducted to verify the indispensability of task assignment and subequivariance.

* ICML 2024

Via

Access Paper or Ask Questions

Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Jul 17, 2024

Kaixin Bai, Lei Zhang, Zhaopeng Chen, Fang Wan, Jianwei Zhang

Figure 1 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Figure 2 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Figure 3 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Figure 4 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Abstract:Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both RGB and physically realistic depth images, surpassing previous dataset generation tools. We create an RGBD dataset tailored for robotic industrial grasping scenarios and evaluate it across various tasks, including object detection, instance segmentation, and embedding sim2real visual perception in industrial robotic grasping. By reducing the sim2real gap and enhancing deep learning training, we facilitate the application of deep learning models in industrial settings. Project details are available at https://baikaixinpublic.github.io/structured light 3D synthesizer/.

* 7 pages, 2024 IEEE International Conference on Robotics and Automation

Via

Access Paper or Ask Questions

Qwen2 Technical Report

Jul 16, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang(+49 more)

Abstract:This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

* 25 pages, 1 figure

Via

Access Paper or Ask Questions

Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Jun 26, 2024

Song Tang, Shaxu Yan, Xiaozhi Qi, Jianxin Gao, Mao Ye, Jianwei Zhang, Xiatian Zhu

Figure 1 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Figure 2 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Figure 3 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Figure 4 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Abstract:Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.

Via

Access Paper or Ask Questions

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

Jun 14, 2024

Hantao Zhou, Tianying Ji, Jianwei Zhang, Fuchun Sun, Huazhe Xu

Abstract:Minigolf, a game with countless court layouts, and complex ball motion, constitutes a compelling real-world testbed for the study of embodied intelligence. As it not only challenges spatial and kinodynamic reasoning but also requires reflective and corrective capacities to address erroneously designed courses. We introduce RoboGolf, a framework that perceives dual-camera visual inputs with nested VLM-empowered closed-loop control and reflective equilibrium loop. Extensive experiments demonstrate the effectiveness of RoboGolf on challenging minigolf courts including those that are impossible to finish.

* Project page: https://jity16.github.io/RoboGolf/

Via

Access Paper or Ask Questions

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Jun 07, 2024

Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao(+15 more)

Figure 1 for Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Figure 2 for Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Figure 3 for Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Figure 4 for Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Abstract:The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

Via

Access Paper or Ask Questions