Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chengjie Zhang

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

Jun 17, 2026

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

Abstract:Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

* Accepted to IROS 2026

Via

Access Paper or Ask Questions

VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

May 28, 2026

Dehao Huang, Aoxiang Gu, Chengjie Zhang, Bolin Zou, Wenlong Dong, Zilang Cen, Yue Wang, Hong Zhang

Abstract:Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Easy-IIL: Reducing Human Operational Burden in Interactive Imitation Learning via Assistant Experts

Mar 13, 2026

Chengjie Zhang, Chao Tang, Wenlong Dong, Dehao Huang, Aoxiang Gu, Hong Zhang

Abstract:Interactive Imitation Learning (IIL) typically relies on extensive human involvement for both offline demonstration and online interaction. Prior work primarily focuses on reducing human effort in passive monitoring rather than active operation. Interestingly, structured model-based imitation approaches achieve comparable performance with significantly fewer demonstrations than end-to-end imitation learning policies in the low-data regime. However, these methods are typically surpassed by end-to-end policies as the data increases. Leveraging this insight, we propose Easy-IIL, a framework that utilizes off-the-shelf model-based imitation methods as an assistant expert to replace active human operation for the majority of data collection. The human expert only provides a single demonstration to initialize the assistant expert and intervenes in critical states where the task is approaching failure. Furthermore, Easy-IIL can maintain IIL performance by preserving both offline and online data quality. Extensive simulation and real-world experiments demonstrate that Easy-IIL significantly reduces human operational burden while maintaining performance comparable to mainstream IIL baselines. User studies further confirm that Easy-IIL reduces subjective workload on the human expert. Project page: https://sites.google.com/view/easy-iil

Via

Access Paper or Ask Questions

Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA

Feb 10, 2025

Chengjie Zhang, Wenda Pan, Xinyang Han, He Kong

Figure 1 for Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA

Figure 2 for Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA

Figure 3 for Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA

Figure 4 for Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA

Abstract:Accurate calibration of acoustic sensing systems made of multiple asynchronous microphone arrays is essential for satisfactory performance in sound source localization and tracking. State-of-the-art calibration methods for this type of system rely on the time difference of arrival and direction of arrival measurements among the microphone arrays (denoted as TDOA-M and DOA, respectively). In this paper, to enhance calibration accuracy, we propose to incorporate the time difference of arrival measurements between adjacent sound events (TDOAS) with respect to the microphone arrays. More specifically, we propose a two-stage calibration approach, including an initial value estimation (IVE) procedure and the final joint optimization step. The IVE stage first initializes all parameters except for microphone array orientations, using hybrid TDOA (i.e., TDOAM and TDOA-S), odometer data from a moving robot carrying a speaker, and DOA. Subsequently, microphone orientations are estimated through the iterative closest point method. The final joint optimization step estimates multiple microphone array locations, orientations, time offsets, clock drift rates, and sound source locations simultaneously. Both simulation and experiment results show that for scenarios with low or moderate TDOA noise levels, our approach outperforms existing methods in terms of accuracy. All code and data are available at https://github.com/AISLABsustech/Hybrid-TDOA-Multi-Calib.

* This paper was accepted and is going to be presented at ICASSP 2025

Via

Access Paper or Ask Questions

Optimal Sensor Placement for TDOA-Based Source Localization with Sensor Location Errors

Oct 28, 2024

Chengjie Zhang, Xinyang Han

Figure 1 for Optimal Sensor Placement for TDOA-Based Source Localization with Sensor Location Errors

Figure 2 for Optimal Sensor Placement for TDOA-Based Source Localization with Sensor Location Errors

Figure 3 for Optimal Sensor Placement for TDOA-Based Source Localization with Sensor Location Errors

Abstract:The accuracy of time difference of arrival (TDOA)-based source localization is influenced by sensor location deployment. Many studies focus on optimal sensor placement (OSP) for TDOA-based localization without sensor location noises (OSP-WSLN). In practice, there are sensor location errors due to installation deviations, etc, which implies the necessity of studying OSP under sensor location noises (OSP-SLN). There are two fundamental problems: What is the OSP-SLN strategy? To what extent do sensor location errors affect the performance of OSP-SLN? For the first one, under the assumption of the near-field and full set of TDOA, minimizing the trace of the Cramer-Rao bound is used as optimization criteria. Based on this, a concise equality, namely Eq. (18), is proven to show that OSP-SLN is equivalent to OSP-WSLN. Extensive simulations validate both equality and equivalence and respond to the second problem: not large sensor position errors give an ignorable negative impact on the performance of OSP-SLN quantified by the trace of CRB. Also, simulations show source localization accuracy with OSP-SLN outperforms that with random placement. These simulations validate our derived OSP-SLN and its effectiveness. We have open-sourced the code for community use.

Via

Access Paper or Ask Questions

Asynchronous Microphone Array Calibration using Hybrid TDOA Information

Mar 12, 2024

Chengjie Zhang, Jiang Wang, He Kong

Figure 1 for Asynchronous Microphone Array Calibration using Hybrid TDOA Information

Figure 2 for Asynchronous Microphone Array Calibration using Hybrid TDOA Information

Figure 3 for Asynchronous Microphone Array Calibration using Hybrid TDOA Information

Figure 4 for Asynchronous Microphone Array Calibration using Hybrid TDOA Information

Abstract:Asynchronous Microphone array calibration is a prerequisite for most audition robot applications. In practice, the calibration requires estimating microphone positions, time offsets, clock drift rates, and sound event locations simultaneously. The existing method proposed Graph-based Simultaneous Localisation and Mapping (Graph-SLAM) utilizing common TDOA, time difference of arrival between two microphones (TDOA-M), and odometry measurement, however, it heavily depends on the initial value. In this paper, we propose a novel TDOA, time difference of arrival between adjacent sound events (TDOA-S), combine it with TDOA-M, called hybrid TDOA, and add odometry measurement to construct Graph-SLAM and use the Gauss-Newton (GN) method to solve. TDOA-S is simple and efficient because it eliminates time offset without generating new variables. Simulation and real-world experiment results consistently show that our method is independent of microphone number, insensitive to initial values, and has better calibration accuracy and stability under various TDOA noises. In addition, the simulation result demonstrates that our method has a lower Cram\'er-Rao lower bound (CRLB) for microphone parameters, which explains the advantages of my method.

Via

Access Paper or Ask Questions

MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion

Jun 29, 2023

Ce Zhang, Chengjie Zhang, Yiluan Guo, Lingji Chen, Michael Happold

Figure 1 for MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion

Figure 2 for MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion

Figure 3 for MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion

Figure 4 for MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion

Abstract:Multiple Object Tracking (MOT) is crucial to autonomous vehicle perception. End-to-end transformer-based algorithms, which detect and track objects simultaneously, show great potential for the MOT task. However, most existing methods focus on image-based tracking with a single object category. In this paper, we propose an end-to-end transformer-based MOT algorithm (MotionTrack) with multi-modality sensor inputs to track objects with multiple classes. Our objective is to establish a transformer baseline for the MOT in an autonomous driving environment. The proposed algorithm consists of a transformer-based data association (DA) module and a transformer-based query enhancement module to achieve MOT and Multiple Object Detection (MOD) simultaneously. The MotionTrack and its variations achieve better results (AMOTA score at 0.55) on the nuScenes dataset compared with other classical baseline models, such as the AB3DMOT, the CenterTrack, and the probabilistic 3D Kalman filter. In addition, we prove that a modified attention mechanism can be utilized for DA to accomplish the MOT, and aggregate history features to enhance the MOD performance.

* This paper is accepted by CVPR WAD 2023

Via

Access Paper or Ask Questions