Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ruisen Tu

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Mar 24, 2026

Ruisen Tu, Arth Shukla, Sohyun Yoo, Xuanlin Li, Junxi Li, Jianwen Xie, Hao Su, Zhuowen Tu

Abstract:Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation quality, we co-train auxiliary decoders that reconstruct interpretable intermediate signals - including global robot position, joint configurations, grasp affordances, target-object relative pose, and segmentation masks - from shared visual-language features. These objectives provide dense supervision that encourages the backbone to develop spatially grounded, manipulation-aware latent representations. Through extensive evaluation on home rearrangement tasks, our approach achieves consistent improvements across picking, placing, opening, and closing operations, substantially outperforming direct imitation learning. Our findings suggest that spatial grounding through auxiliary and multi-modal learning provides a strong direction for scaling VLA models toward general-purpose domestic robots.

Via

Access Paper or Ask Questions

3D Hand Pose Estimation in Egocentric Images in the Wild

Dec 11, 2023

Aditya Prakash, Ruisen Tu, Matthew Chang, Saurabh Gupta

Figure 1 for 3D Hand Pose Estimation in Egocentric Images in the Wild

Figure 2 for 3D Hand Pose Estimation in Egocentric Images in the Wild

Figure 3 for 3D Hand Pose Estimation in Egocentric Images in the Wild

Figure 4 for 3D Hand Pose Estimation in Egocentric Images in the Wild

Abstract:We present WildHands, a method for 3D hand pose estimation in egocentric images in the wild. This is challenging due to (a) lack of 3D hand pose annotations for images in the wild, and (b) a form of perspective distortion-induced shape ambiguity that arises in the analysis of crops around hands. For the former, we use auxiliary supervision on in-the-wild data in the form of segmentation masks & grasp labels in addition to 3D supervision available in lab datasets. For the latter, we provide spatial cues about the location of the hand crop in the camera's field of view. Our approach achieves the best 3D hand pose on the ARCTIC leaderboard and outperforms FrankMocap, a popular and robust approach for estimating hand pose in the wild, by 45.3% when evaluated on 2D hand pose on our EPIC-HandKps dataset.

* Project page: https://ap229997.github.io/projects/hands/

Via

Access Paper or Ask Questions

Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions

Jan 19, 2023

Junyang Cai, Khai-Nguyen Nguyen, Nishant Shrestha, Aidan Good, Ruisen Tu, Xin Yu, Shandian Zhe, Thiago Serra

Figure 1 for Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions

Figure 2 for Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions

Figure 3 for Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions

Figure 4 for Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions

Abstract:One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. This drop plausibly reflects a loss in model complexity, which we aim to avoid. In this work, we explore how sparsity also affects the geometry of the linear regions defined by a neural network, and consequently reduces the expected maximum number of linear regions based on the architecture. We observe that pruning affects accuracy similarly to how sparsity affects the number of linear regions and our proposed bound for the maximum number. Conversely, we find out that selecting the sparsity across layers to maximize our bound very often improves accuracy in comparison to pruning as much with the same sparsity in all layers, thereby providing us guidance on where to prune.

* (Under review)

Via

Access Paper or Ask Questions