Abstract:Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.




Abstract:Perception in robot manipulation has been actively explored with the goal of advancing and integrating vision and touch for global and local feature extraction. However, it is difficult to perceive certain object internal states, and the integration of visual and haptic perception is not compact and is easily biased. We propose to address these limitations by developing an active acoustic sensing method for robot manipulation. Active acoustic sensing relies on the resonant properties of the object, which are related to its material, shape, internal structure, and contact interactions with the gripper and environment. The sensor consists of a vibration actuator paired with a piezo-electric microphone. The actuator generates a waveform, and the microphone tracks the waveform's propagation and distortion as it travels through the object. This paper presents the sensing principles, hardware design, simulation development, and evaluation of physical and simulated sensory data under different conditions as a proof-of-concept. This work aims to provide fundamentals on a useful tool for downstream robot manipulation tasks using active acoustic sensing, such as object recognition, grasping point estimation, object pose estimation, and external contact formation detection.