Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

He Wang

University College London

D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

Sep 25, 2024

Songlin Wei, Haoran Geng, Jiayi Chen, Congyue Deng, Wenbo Cui, Chengyang Zhao, Xiaomeng Fang, Leonidas Guibas, He Wang

Figure 1 for D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

Figure 2 for D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

Figure 3 for D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

Figure 4 for D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

Abstract:Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios.

Via

Access Paper or Ask Questions

ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real

Sep 21, 2024

Jiangran Lyu, Yuxing Chen, Tao Du, Feng Zhu, Huiquan Liu, Yizhou Wang, He Wang

Figure 1 for ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real

Figure 2 for ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real

Figure 3 for ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real

Figure 4 for ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real

Abstract:This paper tackles the challenging robotic task of generalizable paper cutting using scissors. In this task, scissors attached to a robot arm are driven to accurately cut curves drawn on the paper, which is hung with the top edge fixed. Due to the frequent paper-scissor contact and consequent fracture, the paper features continual deformation and changing topology, which is diffult for accurate modeling. To ensure effective execution, we customize an action primitive sequence for imitation learning to constrain its action space, thus alleviating potential compounding errors. Finally, by integrating sim-to-real techniques to bridge the gap between simulation and reality, our policy can be effectively deployed on the real robot. Experimental results demonstrate that our method surpasses all baselines in both simulation and real-world benchmarks and achieves performance comparable to human operation with a single hand under the same conditions.

* Accepted by CoRL2024

Via

Access Paper or Ask Questions

Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

Sep 11, 2024

Mohsen Azarmi, Mahdi Rezaei, He Wang, Ali Arabian

Figure 1 for Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

Figure 2 for Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

Figure 3 for Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

Figure 4 for Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review

Abstract:Recent advancements in predicting pedestrian crossing intentions for Autonomous Vehicles using Computer Vision and Deep Neural Networks are promising. However, the black-box nature of DNNs poses challenges in understanding how the model works and how input features contribute to final predictions. This lack of interpretability delimits the trust in model performance and hinders informed decisions on feature selection, representation, and model optimisation; thereby affecting the efficacy of future research in the field. To address this, we introduce Context-aware Permutation Feature Importance (CAPFI), a novel approach tailored for pedestrian intention prediction. CAPFI enables more interpretability and reliable assessments of feature importance by leveraging subdivided scenario contexts, mitigating the randomness of feature values through targeted shuffling. This aims to reduce variance and prevent biased estimations in importance scores during permutations. We divide the Pedestrian Intention Estimation (PIE) dataset into 16 comparable context sets, measure the baseline performance of five distinct neural network architectures for intention prediction in each context, and assess input feature importance using CAPFI. We observed nuanced differences among models across various contextual characteristics. The research reveals the critical role of pedestrian bounding boxes and ego-vehicle speed in predicting pedestrian intentions, and potential prediction biases due to the speed feature through cross-context permutation evaluation. We propose an alternative feature representation by considering proximity change rate for rendering dynamic pedestrian-vehicle locomotion, thereby enhancing the contributions of input features to intention prediction. These findings underscore the importance of contextual features and their diversity to develop accurate and robust intent-predictive models.

Via

Access Paper or Ask Questions

Decoupling Contact for Fine-Grained Motion Style Transfer

Sep 09, 2024

Xiangjun Tang, Linjun Wu, He Wang, Yiqian Wu, Bo Hu, Songnan Li, Xu Gong, Yuchen Liao, Qilong Kou, Xiaogang Jin

Abstract:Motion style transfer changes the style of a motion while retaining its content and is useful in computer animations and games. Contact is an essential component of motion style transfer that should be controlled explicitly in order to express the style vividly while enhancing motion naturalness and quality. However, it is unknown how to decouple and control contact to achieve fine-grained control in motion style transfer. In this paper, we present a novel style transfer method for fine-grained control over contacts while achieving both motion naturalness and spatial-temporal variations of style. Based on our empirical evidence, we propose controlling contact indirectly through the hip velocity, which can be further decomposed into the trajectory and contact timing, respectively. To this end, we propose a new model that explicitly models the correlations between motions and trajectory/contact timing/style, allowing us to decouple and control each separately. Our approach is built around a motion manifold, where hip controls can be easily integrated into a Transformer-based decoder. It is versatile in that it can generate motions directly as well as be used as post-processing for existing methods to improve quality and contact controllability. In addition, we propose a new metric that measures a correlation pattern of motions based on our empirical evidence, aligning well with human perception in terms of motion naturalness. Based on extensive evaluation, our method outperforms existing methods in terms of style expressivity and motion quality.

Via

Access Paper or Ask Questions

TASAR: Transferable Attack on Skeletal Action Recognition

Sep 04, 2024

Yunfeng Diao, Baiqi Wu, Ruixuan Zhang, Ajian Liu, Xingxing Wei, Meng Wang, He Wang

Abstract:Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferability and, therefore, cannot be considered true transfer-based S-HAR attacks. More importantly, the reason for this failure remains unclear. In this paper, we study this phenomenon through the lens of loss surface, and find that its sharpness contributes to the poor transferability in S-HAR. Inspired by this observation, we assume and empirically validate that smoothening the rugged loss landscape could potentially improve adversarial transferability in S-HAR. To this end, we propose the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothed model posterior without re-training the pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike previous transfer-based attacks that treat each frame independently and overlook temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack gradient, effectively disrupting the spatial-temporal coherence of S-HARs. To exhaustively evaluate the effectiveness of existing methods and our method, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the supplementary material.

* arXiv admin note: text overlap with arXiv:2407.08572

Via

Access Paper or Ask Questions

The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

Aug 05, 2024

He Wang, Lei Xie

Figure 1 for The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

Figure 2 for The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

Figure 3 for The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

Abstract:This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, introducing Enhanced ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing second place in the open track of the Single-Speaker Task and first place in the other three tracks.

* 2 pages, 2 figures, CNVSRC 2024 System Report

Via

Access Paper or Ask Questions

On Flange-based 3D Hand-Eye Calibration for Soft Robotic Tactile Welding

Jul 27, 2024

Xudong Han, Ning Guo, Yu Jie, He Wang, Fang Wan, Chaoyang Song

Figure 1 for On Flange-based 3D Hand-Eye Calibration for Soft Robotic Tactile Welding

Figure 2 for On Flange-based 3D Hand-Eye Calibration for Soft Robotic Tactile Welding

Figure 3 for On Flange-based 3D Hand-Eye Calibration for Soft Robotic Tactile Welding

Figure 4 for On Flange-based 3D Hand-Eye Calibration for Soft Robotic Tactile Welding

Abstract:This paper investigates the direct application of standardized designs on the robot for conducting robot hand-eye calibration by employing 3D scanners with collaborative robots. The well-established geometric features of the robot flange are exploited by directly capturing its point cloud data. In particular, an iterative method is proposed to facilitate point cloud processing toward a refined calibration outcome. Several extensive experiments are conducted over a range of collaborative robots, including Universal Robots UR5 & UR10 e-series, Franka Emika, and AUBO i5 using an industrial-grade 3D scanner Photoneo Phoxi S & M and a commercial-grade 3D scanner Microsoft Azure Kinect DK. Experimental results show that translational and rotational errors converge efficiently to less than 0.28 mm and 0.25 degrees, respectively, achieving a hand-eye calibration accuracy as high as the camera's resolution, probing the hardware limit. A welding seam tracking system is presented, combining the flange-based calibration method with soft tactile sensing. The experiment results show that the system enables the robot to adjust its motion in real-time, ensuring consistent weld quality and paving the way for more efficient and adaptable manufacturing processes.

* 25 pages, 14 figures, 2 tables, Accepted by Measurement

Via

Access Paper or Ask Questions

Interactive Rendering of Relightable and Animatable Gaussian Avatars

Jul 15, 2024

Youyi Zhan, Tianjia Shao, He Wang, Yin Yang, Kun Zhou

Figure 1 for Interactive Rendering of Relightable and Animatable Gaussian Avatars

Figure 2 for Interactive Rendering of Relightable and Animatable Gaussian Avatars

Figure 3 for Interactive Rendering of Relightable and Animatable Gaussian Avatars

Figure 4 for Interactive Rendering of Relightable and Animatable Gaussian Avatars

Abstract:Creating relightable and animatable avatars from multi-view or monocular videos is a challenging task for digital human creation and virtual reality applications. Previous methods rely on neural radiance fields or ray tracing, resulting in slow training and rendering processes. By utilizing Gaussian Splatting, we propose a simple and efficient method to decouple body materials and lighting from sparse-view or monocular avatar videos, so that the avatar can be rendered simultaneously under novel viewpoints, poses, and lightings at interactive frame rates (6.9 fps). Specifically, we first obtain the canonical body mesh using a signed distance function and assign attributes to each mesh vertex. The Gaussians in the canonical space then interpolate from nearby body mesh vertices to obtain the attributes. We subsequently deform the Gaussians to the posed space using forward skinning, and combine the learnable environment light with the Gaussian attributes for shading computation. To achieve fast shadow modeling, we rasterize the posed body mesh from dense viewpoints to obtain the visibility. Our approach is not only simple but also fast enough to allow interactive rendering of avatar animation under environmental light changes. Experiments demonstrate that, compared to previous works, our method can render higher quality results at a faster speed on both synthetic and real datasets.

Via

Access Paper or Ask Questions

Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

Jul 11, 2024

Yunfeng Diao, Baiqi Wu, Ruixuan Zhang, Xun Yang, Meng Wang, He Wang

Figure 1 for Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

Figure 2 for Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

Figure 3 for Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

Figure 4 for Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

Abstract:Skeletal motion plays a pivotal role in human activity recognition (HAR). Recently, attack methods have been proposed to identify the universal vulnerability of skeleton-based HAR(S-HAR). However, the research of adversarial transferability on S-HAR is largely missing. More importantly, existing attacks all struggle in transfer across unknown S-HAR models. We observed that the key reason is that the loss landscape of the action recognizers is rugged and sharp. Given the established correlation in prior studies~\cite{qin2022boosting,wu2020towards} between loss landscape and adversarial transferability, we assume and empirically validate that smoothing the loss landscape could potentially improve adversarial transferability on S-HAR. This is achieved by proposing a new post-train Dual Bayesian strategy, which can effectively explore the model posterior space for a collection of surrogates without the need for re-training. Furthermore, to craft adversarial examples along the motion manifold, we incorporate the attack gradient with information of the motion dynamics in a Bayesian manner. Evaluated on benchmark datasets, e.g. HDM05 and NTU 60, the average transfer success rate can reach as high as 35.9\% and 45.5\% respectively. In comparison, current state-of-the-art skeletal attacks achieve only 3.6\% and 9.8\%. The high adversarial transferability remains consistent across various surrogate, victim, and even defense models. Through a comprehensive analysis of the results, we provide insights on what surrogates are more likely to exhibit transferability, to shed light on future research.

Via

Access Paper or Ask Questions

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Jul 05, 2024

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang

Figure 1 for RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Figure 2 for RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Figure 3 for RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Figure 4 for RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Abstract:This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/.

Via

Access Paper or Ask Questions