Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nakul Gopalan

PokeNet: Learning Kinematic Models of Articulated Objects from Human Observations

Feb 02, 2026

Anmol Gupta, Weiwei Gu, Omkar Patil, Jun Ki Lee, Nakul Gopalan

Abstract:Articulation modeling enables robots to learn joint parameters of articulated objects for effective manipulation which can then be used downstream for skill learning or planning. Existing approaches often rely on prior knowledge about the objects, such as the number or type of joints. Some of these approaches also fail to recover occluded joints that are only revealed during interaction. Others require large numbers of multi-view images for every object, which is impractical in real-world settings. Furthermore, prior works neglect the order of manipulations, which is essential for many multi-DoF objects where one joint must be operated before another, such as a dishwasher. We introduce PokeNet, an end-to-end framework that estimates articulation models from a single human demonstration without prior object knowledge. Given a sequence of point cloud observations of a human manipulating an unknown object, PokeNet predicts joint parameters, infers manipulation order, and tracks joint states over time. PokeNet outperforms existing state-of-the-art methods, improving joint axis and state estimation accuracy by an average of over 27% across diverse objects, including novel and unseen categories. We demonstrate these gains in both simulation and real-world environments.

Via

Access Paper or Ask Questions

Learning Sequential Kinematic Models from Demonstrations for Multi-Jointed Articulated Objects

May 09, 2025

Anmol Gupta, Weiwei Gu, Omkar Patil, Jun Ki Lee, Nakul Gopalan

Abstract:As robots become more generalized and deployed in diverse environments, they must interact with complex objects, many with multiple independent joints or degrees of freedom (DoF) requiring precise control. A common strategy is object modeling, where compact state-space models are learned from real-world observations and paired with classical planning. However, existing methods often rely on prior knowledge or focus on single-DoF objects, limiting their applicability. They also fail to handle occluded joints and ignore the manipulation sequences needed to access them. We address this by learning object models from human demonstrations. We introduce Object Kinematic Sequence Machines (OKSMs), a novel representation capturing both kinematic constraints and manipulation order for multi-DoF objects. To estimate these models from point cloud data, we present Pokenet, a deep neural network trained on human demonstrations. We validate our approach on 8,000 simulated and 1,600 real-world annotated samples. Pokenet improves joint axis and state estimation by over 20 percent on real-world data compared to prior methods. Finally, we demonstrate OKSMs on a Sawyer robot using inverse kinematics-based planning to manipulate multi-DoF objects.

Via

Access Paper or Ask Questions

Composing Diffusion Policies for Few-shot Learning of Movement Trajectories

Oct 22, 2024

Omkar Patil, Anant Sah, Nakul Gopalan

Figure 1 for Composing Diffusion Policies for Few-shot Learning of Movement Trajectories

Figure 2 for Composing Diffusion Policies for Few-shot Learning of Movement Trajectories

Figure 3 for Composing Diffusion Policies for Few-shot Learning of Movement Trajectories

Figure 4 for Composing Diffusion Policies for Few-shot Learning of Movement Trajectories

Abstract:Humans can perform various combinations of physical skills without having to relearn skills from scratch every single time. For example, we can swing a bat when walking without having to re-learn such a policy from scratch by composing the individual skills of walking and bat swinging. Enabling robots to combine or compose skills is essential so they can learn novel skills and tasks faster with fewer real world samples. To this end, we propose a novel compositional approach called DSE- Diffusion Score Equilibrium that enables few-shot learning for novel skills by utilizing a combination of base policy priors. Our method is based on probabilistically composing diffusion policies to better model the few-shot demonstration data-distribution than any individual policy. Our goal here is to learn robot motions few-shot and not necessarily goal oriented trajectories. Unfortunately we lack a general purpose metric to evaluate the error between a skill or motion and the provided demonstrations. Hence, we propose a probabilistic measure - Maximum Mean Discrepancy on the Forward Kinematics Kernel (MMD-FK), that is task and action space agnostic. By using our few-shot learning approach DSE, we show that we are able to achieve a reduction of over 30% in MMD-FK across skills and number of demonstrations. Moreover, we show the utility of our approach through real world experiments by teaching novel trajectories to a robot in 5 demonstrations.

* 6(+1) pages, 6 figures

Via

Access Paper or Ask Questions

Continual Skill and Task Learning via Dialogue

Sep 05, 2024

Weiwei Gu, Suresh Kondepudi, Lixiao Huang, Nakul Gopalan

Figure 1 for Continual Skill and Task Learning via Dialogue

Figure 2 for Continual Skill and Task Learning via Dialogue

Figure 3 for Continual Skill and Task Learning via Dialogue

Figure 4 for Continual Skill and Task Learning via Dialogue

Abstract:Continual and interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to query and learn visuo-motor robot skills and task relevant information via natural language dialog interactions with human users. Previous approaches either focus on improving the performance of instruction following agents, or passively learn novel skills or concepts. Instead, we used dialog combined with a language-skill grounding embedding to query or confirm skills and/or tasks requested by a user. To achieve this goal, we developed and integrated three different components for our agent. Firstly, we propose a novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA), which enables the existing SoTA ACT model to perform few-shot continual learning. Secondly, we develop an alignment model that projects demonstrations across skill embodiments into a shared embedding allowing us to know when to ask questions and/or demonstrations from users. Finally, we integrated an existing LLM to interact with a human user to perform grounded interactive continual skill learning to solve a task. Our ACT-LoRA model learns novel fine-tuned skills with a 100% accuracy when trained with only five demonstrations for a novel skill while still maintaining a 74.75% accuracy on pre-trained skills in the RLBench dataset where other models fall significantly short. We also performed a human-subjects study with 8 subjects to demonstrate the continual learning capabilities of our combined framework. We achieve a success rate of 75% in the task of sandwich making with the real robot learning from participant data demonstrating that robots can learn novel skills or task knowledge from dialogue with non-expert users using our approach.

Via

Access Paper or Ask Questions

Interactive Visual Task Learning for Robots

Dec 20, 2023

Weiwei Gu, Anant Sah, Nakul Gopalan

Figure 1 for Interactive Visual Task Learning for Robots

Figure 2 for Interactive Visual Task Learning for Robots

Figure 3 for Interactive Visual Task Learning for Robots

Figure 4 for Interactive Visual Task Learning for Robots

Abstract:We present a framework for robots to learn novel visual concepts and tasks via in-situ linguistic interactions with human users. Previous approaches have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies by enabling them to learn novel concepts and solve unseen robotics tasks with them. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. We present two sets of results. Firstly, we compare Hi-Viscont with the baseline model (FALCON) on visual question answering(VQA) in three domains. While being comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an improvement of over 9% on non-leaf concepts on average. We compare our model's performance against the baseline FALCON model. Our framework achieves 33% improvements in success rate metric, and 19% improvements in the object level accuracy compared to the baseline model. With both of these results we demonstrate the ability of our model to learn tasks and concepts in a continual learning setting on the robot.

* In Proceedings of The 38th Annual AAAI Conference on Artificial Intelligence

Via

Access Paper or Ask Questions

Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

Oct 03, 2023

Ifrah Idrees, Tian Yun, Naveen Sharma, Yunxin Deng, Nakul Gopalan, George Konidaris, Stefanie Tellex

Figure 1 for Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

Figure 2 for Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

Figure 3 for Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

Figure 4 for Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

Abstract:Conversational assistive robots can aid people, especially those with cognitive impairments, to accomplish various tasks such as cooking meals, performing exercises, or operating machines. However, to interact with people effectively, robots must recognize human plans and goals from noisy observations of human actions, even when the user acts sub-optimally. Previous works on Plan and Goal Recognition (PGR) as planning have used hierarchical task networks (HTN) to model the actor/human. However, these techniques are insufficient as they do not have user engagement via natural modes of interaction such as language. Moreover, they have no mechanisms to let users, especially those with cognitive impairments, know of a deviation from their original plan or about any sub-optimal actions taken towards their goal. We propose a novel framework for plan and goal recognition in partially observable domains -- Dialogue for Goal Recognition (D4GR) enabling a robot to rectify its belief in human progress by asking clarification questions about noisy sensor data and sub-optimal human actions. We evaluate the performance of D4GR over two simulated domains -- kitchen and blocks domain. With language feedback and the world state information in a hierarchical task model, we show that D4GR framework for the highest sensor noise performs 1% better than HTN in goal accuracy in both domains. For plan accuracy, D4GR outperforms by 4% in the kitchen domain and 2% in the blocks domain in comparison to HTN. The ALWAYS-ASK oracle outperforms our policy by 3% in goal recognition and 7%in plan recognition. D4GR does so by asking 68% fewer questions than an oracle baseline. We also demonstrate a real-world robot scenario in the kitchen domain, validating the improved plan and goal recognition of D4GR in a realistic setting.

* Published in IROS 2023

Via

Access Paper or Ask Questions

Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Sep 01, 2023

Divyanshu Raj, Chitta Baral, Nakul Gopalan

Figure 1 for Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Figure 2 for Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Figure 3 for Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Figure 4 for Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Abstract:In this work, we present an approach to identify sub-tasks within a demonstrated robot trajectory using language instructions. We identify these sub-tasks using language provided during demonstrations as guidance to identify sub-segments of a longer robot trajectory. Given a sequence of natural language instructions and a long trajectory consisting of image frames and discrete actions, we want to map an instruction to a smaller fragment of the trajectory. Unlike previous instruction following works which directly learn the mapping from language to a policy, we propose a language-conditioned change-point detection method to identify sub-tasks in a problem. Our approach learns the relationship between constituent segments of a long language command and corresponding constituent segments of a trajectory. These constituent trajectory segments can be used to learn subtasks or sub-goals for planning or options as demonstrated by previous related work. Our insight in this work is that the language-conditioned robot change-point detection problem is similar to the existing video moment retrieval works used to identify sub-segments within online videos. Through extensive experimentation, we demonstrate a $1.78_{\pm 0.82}\%$ improvement over a baseline approach in accurately identifying sub-tasks within a trajectory using our proposed method. Moreover, we present a comprehensive study investigating sample complexity requirements on learning this mapping, between language and trajectory sub-segments, to understand if the video retrieval-based methods are realistic in real robot scenarios.

* 9 Pages, 13 figures, Accepted paper at the RSS 2023 Workshop on Articulate Robots: Utilizing Language for Robot Learning

Via

Access Paper or Ask Questions

Learning Efficient Exploration through Human Seeded Rapidly-exploring Random Trees

Mar 23, 2022

Max Zuo, Logan Schick, Matthew Gombolay, Nakul Gopalan

Figure 1 for Learning Efficient Exploration through Human Seeded Rapidly-exploring Random Trees

Figure 2 for Learning Efficient Exploration through Human Seeded Rapidly-exploring Random Trees

Figure 3 for Learning Efficient Exploration through Human Seeded Rapidly-exploring Random Trees

Figure 4 for Learning Efficient Exploration through Human Seeded Rapidly-exploring Random Trees

Abstract:Modern day computer games have extremely large state and action spaces. To detect bugs in these games' models, human testers play the games repeatedly to explore the game and find errors in the games. Such game play is exhaustive and time consuming. Moreover, since robotics simulators depend on similar methods of model specification and debugging, the problem of finding errors in the model is of interest for the robotics community to ensure robot behaviors and interactions are consistent in simulators. Previous methods have used reinforcement learning and search based methods including Rapidly-exploring Random Trees (RRT) to explore a game's state-action space to find bugs. However, such search and exploration based methods are not efficient at exploring the state-action space without a pre-defined heuristic. In this work we attempt to combine a human-tester's expertise in solving games, and the exhaustiveness of RRT to search a game's state space efficiently with high coverage. This paper introduces human-seeded RRT (HS-RRT) and behavior-cloning-assisted RRT (CA-RRT) in testing the number of game states searched and the time taken to explore those game states. We compare our methods to an existing weighted RRT baseline for game exploration testing studied. We find HS-RRT and CA-RRT both explore more game states in fewer tree expansions/iterations when compared to the existing baseline. In each test, CA-RRT reached more states on average in the same number of iterations as RRT. In our tested environments, CA-RRT was able to reach the same number of states as RRT by more than 5000 fewer iterations on average, almost a 50% reduction.

* HRI 2022 Workshop - MLHRC

Via

Access Paper or Ask Questions

Learning to Follow Language Instructions with Compositional Policies

Oct 09, 2021

Vanya Cohen, Geraud Nangue Tasse, Nakul Gopalan, Steven James, Matthew Gombolay, Benjamin Rosman

Figure 1 for Learning to Follow Language Instructions with Compositional Policies

Figure 2 for Learning to Follow Language Instructions with Compositional Policies

Figure 3 for Learning to Follow Language Instructions with Compositional Policies

Figure 4 for Learning to Follow Language Instructions with Compositional Policies

Abstract:We propose a framework that learns to execute natural language instructions in an environment consisting of goal-reaching tasks that share components of their task descriptions. Our approach leverages the compositionality of both value functions and language, with the aim of reducing the sample complexity of learning novel tasks. First, we train a reinforcement learning agent to learn value functions that can be subsequently composed through a Boolean algebra to solve novel tasks. Second, we fine-tune a seq2seq model pretrained on web-scale corpora to map language to logical expressions that specify the required value function compositions. Evaluating our agent in the BabyAI domain, we observe a decrease of 86% in the number of training steps needed to learn a second task after mastering a single task. Results from ablation studies further indicate that it is the combination of compositional value functions and language representations that allows the agent to quickly generalize to new tasks.

* Presented at AI-HRI symposium as part of AAAI-FSS 2021 (arXiv:2109.10836)

Via

Access Paper or Ask Questions

Interpretable Policy Specification and Synthesis through Natural Language and RL

Jan 18, 2021

Pradyumna Tambwekar, Andrew Silva, Nakul Gopalan, Matthew Gombolay

Figure 1 for Interpretable Policy Specification and Synthesis through Natural Language and RL

Figure 2 for Interpretable Policy Specification and Synthesis through Natural Language and RL

Figure 3 for Interpretable Policy Specification and Synthesis through Natural Language and RL

Figure 4 for Interpretable Policy Specification and Synthesis through Natural Language and RL

Abstract:Policy specification is a process by which a human can initialize a robot's behaviour and, in turn, warm-start policy optimization via Reinforcement Learning (RL). While policy specification/design is inherently a collaborative process, modern methods based on Learning from Demonstration or Deep RL lack the model interpretability and accessibility to be classified as such. Current state-of-the-art methods for policy specification rely on black-box models, which are an insufficient means of collaboration for non-expert users: These models provide no means of inspecting policies learnt by the agent and are not focused on creating a usable modality for teaching robot behaviour. In this paper, we propose a novel machine learning framework that enables humans to 1) specify, through natural language, interpretable policies in the form of easy-to-understand decision trees, 2) leverage these policies to warm-start reinforcement learning and 3) outperform baselines that lack our natural language initialization mechanism. We train our approach by collecting a first-of-its-kind corpus mapping free-form natural language policy descriptions to decision tree-based policies. We show that our novel framework translates natural language to decision trees with a 96% and 97% accuracy on a held-out corpus across two domains, respectively. Finally, we validate that policies initialized with natural language commands are able to significantly outperform relevant baselines (p < 0.001) that do not benefit from our natural language-based warm-start technique.

Via

Access Paper or Ask Questions