Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Hsu

NUS

10 Open Challenges Steering the Future of Vision-Language-Action Models

Nov 08, 2025

Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu

Abstract:Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

* AAAI 2026 (Senior Track)

Via

Access Paper or Ask Questions

Robot Operation of Home Appliances by Reading User Manuals

May 26, 2025

Jian Zhang, Hanbo Zhang, Anxing Xiao, David Hsu

Figure 1 for Robot Operation of Home Appliances by Reading User Manuals

Figure 2 for Robot Operation of Home Appliances by Reading User Manuals

Figure 3 for Robot Operation of Home Appliances by Reading User Manuals

Figure 4 for Robot Operation of Home Appliances by Reading User Manuals

Abstract:Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by "reading" their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones.

Via

Access Paper or Ask Questions

FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Feb 17, 2025

Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, Hong Zhang

Figure 1 for FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Figure 2 for FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Figure 3 for FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Figure 4 for FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Abstract:Learning tool use from a single human demonstration video offers a highly intuitive and efficient approach to robot teaching. While humans can effortlessly generalize a demonstrated tool manipulation skill to diverse tools that support the same function (e.g., pouring with a mug versus a teapot), current one-shot imitation learning (OSIL) methods struggle to achieve this. A key challenge lies in establishing functional correspondences between demonstration and test tools, considering significant geometric variations among tools with the same function (i.e., intra-function variations). To address this challenge, we propose FUNCTO (Function-Centric OSIL for Tool Manipulation), an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation, enabling robots to generalize tool manipulation skills from a single human demonstration video to novel tools with the same function despite significant intra-function variations. With this formulation, we factorize FUNCTO into three stages: (1) functional keypoint extraction, (2) function-centric correspondence establishment, and (3) functional keypoint-based action planning. We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods through real-robot experiments on diverse tool manipulation tasks. The results demonstrate the superiority of FUNCTO when generalizing to novel tools with intra-function geometric variations. More details are available at https://sites.google.com/view/functo.

Via

Access Paper or Ask Questions

Robi Butler: Remote Multimodal Interactions with Household Robot Assistant

Sep 30, 2024

Anxing Xiao, Nuwan Janaka, Tianrun Hu, Anshul Gupta, Kaixin Li, Cunjun Yu, David Hsu

Abstract:In this paper, we introduce Robi Butler, a novel household robotic system that enables multimodal interactions with remote users. Building on the advanced communication interfaces, Robi Butler allows users to monitor the robot's status, send text or voice instructions, and select target objects by hand pointing. At the core of our system is a high-level behavior module, powered by Large Language Models (LLMs), that interprets multimodal instructions to generate action plans. These plans are composed of a set of open vocabulary primitives supported by Vision Language Models (VLMs) that handle both text and pointing queries. The integration of the above components allows Robi Butler to ground remote multimodal instructions in the real-world home environment in a zero-shot manner. We demonstrate the effectiveness and efficiency of this system using a variety of daily household tasks that involve remote users giving multimodal instructions. Additionally, we conducted a user study to analyze how multimodal interactions affect efficiency and user experience during remote human-robot interaction and discuss the potential improvements.

Via

Access Paper or Ask Questions

Stable Object Placement Under Geometric Uncertainty via Differentiable Contact Dynamics

Sep 26, 2024

Linfeng Li, Gang Yang, Lin Shao, David Hsu

Abstract:From serving a cup of coffee to carefully rearranging delicate items, stable object placement is a crucial skill for future robots. This skill is challenging due to the required accuracy, which is difficult to achieve under geometric uncertainty. We leverage differentiable contact dynamics to develop a principled method for stable object placement under geometric uncertainty. We estimate the geometric uncertainty by minimizing the discrepancy between the force-torque sensor readings and the model predictions through gradient descent. We further keep track of a belief over multiple possible geometric parameters to mitigate the gradient-based method's sensitivity to the initialization. We verify our approach in the real world on various geometric uncertainties, including the in-hand pose uncertainty of the grasped object, the object's shape uncertainty, and the environment's shape uncertainty.

Via

Access Paper or Ask Questions

General-purpose Clothes Manipulation with Semantic Keypoints

Aug 15, 2024

Yuhong Deng, David Hsu

Abstract:We have seen much recent progress in task-specific clothes manipulation, but generalizable clothes manipulation is still a challenge. Clothes manipulation requires sequential actions, making it challenging to generalize to unseen tasks. Besides, a general clothes state representation method is crucial. In this paper, we adopt language instructions to specify and decompose clothes manipulation tasks, and propose a large language model based hierarchical learning method to enhance generalization. For state representation, we use semantic keypoints to capture the geometry of clothes and outline their manipulation methods. Simulation experiments show that the proposed method outperforms the baseline method in terms of success rate and generalization for clothes manipulation tasks.

Via

Access Paper or Ask Questions

TOM: A Development Platform For Wearable Intelligent Assistants

Jul 22, 2024

Nuwan Janaka, Shengdong Zhao, David Hsu, Sherisse Tan Jing Wen, Koh Chun Keat

Figure 1 for TOM: A Development Platform For Wearable Intelligent Assistants

Figure 2 for TOM: A Development Platform For Wearable Intelligent Assistants

Figure 3 for TOM: A Development Platform For Wearable Intelligent Assistants

Figure 4 for TOM: A Development Platform For Wearable Intelligent Assistants

Abstract:Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform (https://github.com/TOM-Platform) designed to support the development of intelligent wearable assistants that are contextually aware of both the user and the environment. This system was developed collaboratively with AR/MR researchers, HCI researchers, AI/Robotic researchers, and software developers, and it continues to evolve to meet the diverse requirements of these stakeholders. TOM facilitates the creation of intelligent assistive AR applications for daily activities and supports the recording and analysis of user interactions, integration of new devices, and the provision of assistance for various activities. Additionally, we showcase several proof-of-concept assistive services and discuss the challenges involved in developing such services.

* UbiComp Companion 2024
* 14 pages, 6 figures, 2 tables

Via

Access Paper or Ask Questions

IntentionNet: Map-Lite Visual Navigation at the Kilometre Scale

Jul 03, 2024

Wei Gao, Bo Ai, Joel Loo, Vinay, David Hsu

Figure 1 for IntentionNet: Map-Lite Visual Navigation at the Kilometre Scale

Figure 2 for IntentionNet: Map-Lite Visual Navigation at the Kilometre Scale

Figure 3 for IntentionNet: Map-Lite Visual Navigation at the Kilometre Scale

Figure 4 for IntentionNet: Map-Lite Visual Navigation at the Kilometre Scale

Abstract:This work explores the challenges of creating a scalable and robust robot navigation system that can traverse both indoor and outdoor environments to reach distant goals. We propose a navigation system architecture called IntentionNet that employs a monolithic neural network as the low-level planner/controller, and uses a general interface that we call intentions to steer the controller. The paper proposes two types of intentions, Local Path and Environment (LPE) and Discretised Local Move (DLM), and shows that DLM is robust to significant metric positioning and mapping errors. The paper also presents Kilo-IntentionNet, an instance of the IntentionNet system using the DLM intention that is deployed on a Boston Dynamics Spot robot, and which successfully navigates through complex indoor and outdoor environments over distances of up to a kilometre with only noisy odometry.

Via

Access Paper or Ask Questions

Open Scene Graphs for Open World Object-Goal Navigation

Jul 02, 2024

Joel Loo, Zhanxin Wu, David Hsu

Figure 1 for Open Scene Graphs for Open World Object-Goal Navigation

Figure 2 for Open Scene Graphs for Open World Object-Goal Navigation

Figure 3 for Open Scene Graphs for Open World Object-Goal Navigation

Figure 4 for Open Scene Graphs for Open World Object-Goal Navigation

Abstract:How can we build robots for open-world semantic navigation tasks, like searching for target objects in novel scenes? While foundation models have the rich knowledge and generalisation needed for these tasks, a suitable scene representation is needed to connect them into a complete robot system. We address this with Open Scene Graphs (OSGs), a topo-semantic representation that retains and organises open-set scene information for these models, and has a structure that can be configured for different environment types. We integrate foundation models and OSGs into the OpenSearch system for Open World Object-Goal Navigation, which is capable of searching for open-set objects specified in natural language, while generalising zero-shot across diverse environments and embodiments. Our OSGs enhance reasoning with Large Language Models (LLM), enabling robust object-goal navigation outperforming existing LLM approaches. Through simulation and real-world experiments, we validate OpenSearch's generalisation across varied environments, robots and novel instructions.

Via

Access Paper or Ask Questions

"Set It Up!": Functional Object Arrangement with Compositional Generative Models

May 20, 2024

Yiqing Xu, Jiayuan Mao, Yilun Du, Tomas Lozáno-Pérez, Leslie Pack Kaebling, David Hsu

Figure 1 for "Set It Up!": Functional Object Arrangement with Compositional Generative Models

Figure 2 for "Set It Up!": Functional Object Arrangement with Compositional Generative Models

Figure 3 for "Set It Up!": Functional Object Arrangement with Compositional Generative Models

Figure 4 for "Set It Up!": Functional Object Arrangement with Compositional Generative Models

Abstract:This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.

* 10 pages main paper, 21 pages appendix, RSS 2024

Via

Access Paper or Ask Questions