Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonathan Francis

Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting

Nov 05, 2024

Michael Büttner, Jonathan Francis, Helge Rhodin, Andrew Melnik

Abstract:This paper introduces a method to enhance Interactive Imitation Learning (IIL) by extracting touch interaction points and tracking object movement from video demonstrations. The approach extends current IIL systems by providing robots with detailed knowledge of both where and how to interact with objects, particularly complex articulated ones like doors and drawers. By leveraging cutting-edge techniques such as 3D Gaussian Splatting and FoundationPose for tracking, this method allows robots to better understand and manipulate objects in dynamic environments. The research lays the foundation for more effective task learning and execution in autonomous robotic systems.

* CoRL 2024, Workshop on Lifelong Learning for Home Robots, Munich, Germany

Via

Access Paper or Ask Questions

Grounding Robot Policies with Visuomotor Language Guidance

Oct 10, 2024

Arthur Bucker, Pablo Ortega-Kral, Jonathan Francis, Jean Oh

Figure 1 for Grounding Robot Policies with Visuomotor Language Guidance

Figure 2 for Grounding Robot Policies with Visuomotor Language Guidance

Figure 3 for Grounding Robot Policies with Visuomotor Language Guidance

Figure 4 for Grounding Robot Policies with Visuomotor Language Guidance

Abstract:Recent advances in the fields of natural language processing and computer vision have shown great potential in understanding the underlying dynamics of the world from large-scale internet data. However, translating this knowledge into robotic systems remains an open challenge, given the scarcity of human-robot interactions and the lack of large-scale datasets of real-world robotic data. Previous robot learning approaches such as behavior cloning and reinforcement learning have shown great capabilities in learning robotic skills from human demonstrations or from scratch in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for new settings. Aiming to address these limitations, we propose an agent-based framework for grounding robot policies to the current context, considering the constraints of a current robot and its environment using visuomotor-grounded language guidance. The proposed framework is composed of a set of conversational agents designed for specific roles -- namely, high-level advisor, visual grounding, monitoring, and robotic agents. Given a base policy, the agents collectively generate guidance at run time to shift the action distribution of the base policy towards more desirable future states. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates both in simulation and in real-world experiments without the need for additional human demonstrations or extensive exploration. Project videos at https://sites.google.com/view/motorcortex/home.

* 19 pages, 6 figures, 1 table

Via

Access Paper or Ask Questions

SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

Sep 16, 2024

Benjamin Stoler, Ingrid Navarro, Jonathan Francis, Jean Oh

Figure 1 for SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

Figure 2 for SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

Figure 3 for SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

Figure 4 for SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario Generation

Abstract:Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned scoring functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: https://github.com/cmubig/SEAL

* 8 pages, 4 figures, 2 tables

Via

Access Paper or Ask Questions

Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making

Aug 17, 2024

Siyu Wu, Alessandro Oltramari, Jonathan Francis, C. Lee Giles, Frank E. Ritter

Figure 1 for Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making

Figure 2 for Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making

Figure 3 for Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making

Figure 4 for Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making

Abstract:Resolving the dichotomy between the human-like yet constrained reasoning processes of Cognitive Architectures and the broad but often noisy inference behavior of Large Language Models (LLMs) remains a challenging but exciting pursuit, for enabling reliable machine reasoning capabilities in production systems. Because Cognitive Architectures are famously developed for the purpose of modeling the internal mechanisms of human cognitive decision-making at a computational level, new investigations consider the goal of informing LLMs with the knowledge necessary for replicating such processes, e.g., guided perception, memory, goal-setting, and action. Previous approaches that use LLMs for grounded decision-making struggle with complex reasoning tasks that require slower, deliberate cognition over fast and intuitive inference -- reporting issues related to the lack of sufficient grounding, as in hallucination. To resolve these challenges, we introduce LLM-ACTR, a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making by integrating the ACT-R Cognitive Architecture with LLMs. Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations, injects this information into trainable LLM adapter layers, and fine-tunes the LLMs for downstream prediction. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability of our approach, compared to LLM-only baselines that leverage chain-of-thought reasoning strategies.

* 20 pages, 8 figures, 2 tables

Via

Access Paper or Ask Questions

RoPotter: Toward Robotic Pottery and Deformable Object Manipulation with Structural Priors

Aug 05, 2024

Uksang Yoo, Adam Hung, Jonathan Francis, Jean Oh, Jeffrey Ichnowski

Abstract:Humans are capable of continuously manipulating a wide variety of deformable objects into complex shapes. This is made possible by our intuitive understanding of material properties and mechanics of the object, for reasoning about object states even when visual perception is occluded. These capabilities allow us to perform diverse tasks ranging from cooking with dough to expressing ourselves with pottery-making. However, developing robotic systems to robustly perform similar tasks remains challenging, as current methods struggle to effectively model volumetric deformable objects and reason about the complex behavior they typically exhibit. To study the robotic systems and algorithms capable of deforming volumetric objects, we introduce a novel robotics task of continuously deforming clay on a pottery wheel. We propose a pipeline for perception and pottery skill-learning, called RoPotter, wherein we demonstrate that structural priors specific to the task of pottery-making can be exploited to simplify the pottery skill-learning process. Namely, we can project the cross-section of the clay to a plane to represent the state of the clay, reducing dimensionality. We also demonstrate a mesh-based method of occluded clay state recovery, toward robotic agents capable of continuously deforming clay. Our experiments show that by using the reduced representation with structural priors based on the deformation behaviors of the clay, RoPotter can perform the long-horizon pottery task with 44.4% lower final shape error compared to the state-of-the-art baselines.

Via

Access Paper or Ask Questions

Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Jul 09, 2024

Sriram Yenamandra, Arun Ramachandran, Mukul Khanna, Karmesh Yadav, Jay Vakil, Andrew Melnik, Michael Büttner, Leon Harz, Lyon Brown, Gora Chand Nandi(+35 more)

Figure 1 for Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Figure 2 for Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Figure 3 for Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Figure 4 for Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Abstract:In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface within that environment. We organized a NeurIPS 2023 competition featuring both simulation and real-world components to evaluate solutions to this task. Our baselines on the most challenging version of this task, using real perception in simulation, achieved only an 0.8% success rate; by the end of the competition, the best participants achieved an 10.8\% success rate, a 13x improvement. We observed that the most successful teams employed a variety of methods, yet two common threads emerged among the best solutions: enhancing error detection and recovery, and improving the integration of perception with decision-making processes. In this paper, we detail the results and methodologies used, both in simulation and real-world settings. We discuss the lessons learned and their implications for future research. Additionally, we compare performance in real and simulated environments, emphasizing the necessity for robust generalization to novel settings.

Via

Access Paper or Ask Questions

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Jul 08, 2024

Aaron Lohner, Francesco Compagno, Jonathan Francis, Alessandro Oltramari

Figure 1 for Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Figure 2 for Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Figure 3 for Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Figure 4 for Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Abstract:Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. The task of being able to classify a traffic scene as a specific type of accident is the focus of this work. We approach the problem by likening a traffic scene to a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of an accident can be referred to as a scene graph, and is used as input for an accident classifier. Better results can be obtained with a classifier that fuses the scene graph input with representations from vision and language. This work introduces a multi-stage, multimodal pipeline to pre-process videos of traffic accidents, encode them as scene graphs, and align this representation with vision and language modalities for accident classification. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.

* Accepted: 1st Workshop on Semantic Reasoning and Goal Understanding in Robotics, at the Robotics Science and Systems Conference (SemRob @ RSS 2024)

Via

Access Paper or Ask Questions

Semantic Augmentation in Images using Language

Apr 02, 2024

Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, Eric Nyberg

Figure 1 for Semantic Augmentation in Images using Language

Figure 2 for Semantic Augmentation in Images using Language

Figure 3 for Semantic Augmentation in Images using Language

Figure 4 for Semantic Augmentation in Images using Language

Abstract:Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to train these diffusion models, we propose a technique to utilize generated images to augment existing datasets. This paper explores various strategies for effective data augmentation to improve the out-of-domain generalization capabilities of deep learning models.

Via

Access Paper or Ask Questions

Attribution Regularization for Multimodal Paradigms

Apr 02, 2024

Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, Eric Nyberg

Figure 1 for Attribution Regularization for Multimodal Paradigms

Abstract:Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dominates the decision-making process, resulting in suboptimal performance. This research project aims to address these challenges by proposing a novel regularization term that encourages multimodal models to effectively utilize information from all modalities when making decisions. The focus of this project lies in the video-audio domain, although the proposed regularization technique holds promise for broader applications in embodied AI research, where multiple modalities are involved. By leveraging this regularization term, the proposed approach aims to mitigate the issue of unimodal dominance and improve the performance of multimodal machine learning systems. Through extensive experimentation and evaluation, the effectiveness and generalizability of the proposed technique will be assessed. The findings of this research project have the potential to significantly contribute to the advancement of multimodal machine learning and facilitate its application in various domains, including multimedia analysis, human-computer interaction, and embodied AI research.

Via

Access Paper or Ask Questions

CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories

Mar 19, 2024

Peide Huang, Wenhao Ding, Jonathan Francis, Bingqing Chen, Ding Zhao

Figure 1 for CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories

Figure 2 for CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories

Figure 3 for CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories

Figure 4 for CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories

Abstract:Simulation is an indispensable tool in the development and testing of autonomous vehicles (AVs), offering an efficient and safe alternative to road testing by allowing the exploration of a wide range of scenarios. Despite its advantages, a significant challenge within simulation-based testing is the generation of safety-critical scenarios, which are essential to ensure that AVs can handle rare but potentially fatal situations. This paper addresses this challenge by introducing a novel generative framework, CaDRE, which is specifically designed for generating diverse and controllable safety-critical scenarios using real-world trajectories. Our approach optimizes for both the quality and diversity of scenarios by employing a unique formulation and algorithm that integrates real-world data, domain knowledge, and black-box optimization techniques. We validate the effectiveness of our framework through extensive testing in three representative types of traffic scenarios. The results demonstrate superior performance in generating diverse and high-quality scenarios with greater sample efficiency than existing reinforcement learning and sampling-based methods.

Via

Access Paper or Ask Questions