We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, large language models can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study recent papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper (Preliminary release. We are committed to further enhancing and updating this work to ensure its quality and relevance) can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is the limited data scale. High-quality data with simultaneously captured human and 3D environments is rare, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, the controller could handle cramped scenes and generalize well to general scenes with limited complexity. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. Our code and data would be made publicly available at https://foruck.github.io/occu-page/.
Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation , and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with notably smaller storage expenditure. Our code will be publicly available.
Human reasoning can be understood as a cooperation between the intuitive, associative "System-1" and the deliberative, logical "System-2". For existing System-1-like methods in visual activity understanding, it is crucial to integrate System-2 processing to improve explainability, generalization, and data efficiency. One possible path of activity reasoning is building a symbolic system composed of symbols and rules, where one rule connects multiple symbols, implying human knowledge and reasoning abilities. Previous methods have made progress, but are defective with limited symbols from handcraft and limited rules from visual-based annotations, failing to cover the complex patterns of activities and lacking compositional generalization. To overcome the defects, we propose a new symbolic system with two ideal important properties: broad-coverage symbols and rational rules. Collecting massive human knowledge via manual annotations is expensive to instantiate this symbolic system. Instead, we leverage the recent advancement of LLMs (Large Language Models) as an approximation of the two ideal properties, i.e., Symbols from Large Language Models (Symbol-LLM). Then, given an image, visual contents from the images are extracted and checked as symbols and activity semantics are reasoned out based on rules via fuzzy logic calculation. Our method shows superiority in extensive activity understanding tasks. Code and data are available at https://mvig-rhos.com/symbol_llm.
Transparent objects are widely used in our daily lives, making it important to teach robots to interact with them. However, it's not easy because the reflective and refractive effects can make RGB-D cameras fail to give accurate geometry measurements. To solve this problem, this paper introduces RFTrans, an RGB-D-based method for surface normal estimation and manipulation of transparent objects. By leveraging refractive flow as an intermediate representation, RFTrans circumvents the drawbacks of directly predicting the geometry (e.g. surface normal) from RGB images and helps bridge the sim-to-real gap. RFTrans integrates the RFNet, which predicts refractive flow, object mask, and boundaries, followed by the F2Net, which estimates surface normal from the refractive flow. To make manipulation possible, a global optimization module will take in the predictions, refine the raw depth, and construct the point cloud with normal. An analytical grasp planning algorithm, ISF, is followed to generate the grasp poses. We build a synthetic dataset with physically plausible ray-tracing rendering techniques to train the networks. Results show that the RFTrans trained on the synthetic dataset can consistently outperform the baseline ClearGrasp in both synthetic and real-world benchmarks by a large margin. Finally, a real-world robot grasping task witnesses an 83% success rate, proving that refractive flow can help enable direct sim-to-real transfer. The code, data, and supplementary materials are available at https://rftrans.robotflow.ai.
This paper presents a novel simulation platform, ZeMa, designed for robotic manipulation tasks concerning soft objects. Such simulation ideally requires three properties: two-way soft-rigid coupling, intersection-free guarantees, and frictional contact modeling, with acceptable runtime suitable for deep and reinforcement learning tasks. Current simulators often satisfy only a subset of these needs, primarily focusing on distinct rigid-rigid or soft-soft interactions. The proposed ZeMa prioritizes physical accuracy and integrates the incremental potential contact method, offering unified dynamics simulation for both soft and rigid objects. It efficiently manages soft-rigid contact, operating 75x faster than baseline tools with similar methodologies like IPC-GraspSim. To demonstrate its applicability, we employ it for parallel grasp generation, penetrated grasp repair, and reinforcement learning for grasping, successfully transferring the trained RL policy to real-world scenarios.
Tactile perception stands as a critical sensory modality for human interaction with the environment. Among various tactile sensor techniques, optical sensor-based approaches have gained traction, notably for producing high-resolution tactile images. This work explores gel elastomer deformation simulation through a physics-based approach. While previous works in this direction usually adopt the explicit material point method (MPM), which has certain limitations in force simulation and rendering, we adopt the finite element method (FEM) and address the challenges in penetration and mesh distortion with incremental potential contact (IPC) method. As a result, we present a simulator named TacIPC, which can ensure numerically stable simulations while accommodating direct rendering and friction modeling. To evaluate TacIPC, we conduct three tasks: pseudo-image quality assessment, deformed geometry estimation, and marker displacement prediction. These tasks show its superior efficacy in reducing the sim-to-real gap. Our method can also seamlessly integrate with existing simulators. More experiments and videos can be found in the supplementary materials and on the website: https://sites.google.com/view/tac-ipc.
In the realm of robotic cloth manipulation, accurately estimating the cloth state during or post-execution is imperative. However, the inherent complexities in a cloth's dynamic behavior and its near-infinite degrees of freedom (DoF) pose significant challenges. Traditional methods have been restricted to using keypoints or boundaries as cues for cloth state, which do not holistically capture the cloth's structure, especially during intricate tasks like folding. Additionally, the critical influence of cloth physics has often been overlooked in past research. Addressing these concerns, we introduce DiffCP, a novel differentiable pipeline that leverages the Anisotropic Elasto-Plastic (A-EP) constitutive model, tailored for differentiable computation and robotic tasks. DiffCP adopts a ``real-to-sim-to-real'' methodology. By observing real-world cloth states through an RGB-D camera and projecting this data into a differentiable simulator, the system identifies physics parameters by minimizing the geometric variance between observed and target states. Extensive experiments demonstrate DiffCP's ability and stability to determine physics parameters under varying manipulations, grasping points, and speeds. Additionally, its applications extend to cloth material identification, manipulation trajectory generation, and more notably, enhancing cloth pose estimation accuracy. More experiments and videos can be found in the supplementary materials and on the website: https://sites.google.com/view/diffcp.
Fluid interactions permeate daily human activities, with properties like density and viscosity playing pivotal roles in household tasks. While density estimation is straightforward through Archimedes' principle, viscosity poses a more intricate challenge, especially given the varied behaviors of Newtonian and non-Newtonian fluids. These fluids, which differ in their stress-strain relationships, are delineated by specific constitutive models such as the Carreau, Cross, and Herschel-Bulkley models, each possessing unique viscosity parameters. This study introduces a novel differentiable fitting framework, DiffStir, tailored to identify key physics parameters via the common daily operation of stirring. By employing a robotic arm for stirring and harnessing a differentiable Material Point Method (diffMPM)-based simulator, the framework can determine fluid parameters by matching observations from both the simulator and the real world. Recognizing the distinct preferences of the aforementioned constitutive models for specific fluids, an online strategy was adopted to adaptively select the most fitting model based on real-world data. Additionally, we propose a refining neural network to bridge the sim-to-real gap and mitigate sensor noise-induced inaccuracies. Comprehensive experiments were conducted to validate the efficacy of DiffStir, showcasing its precision in parameter estimation when benchmarked against reported literature values. More experiments and videos can be found in the supplementary materials and on the website: https://sites.google.com/view/diffstir.
This work presents a novel tactile perception-based method, named T-NT, for performing the needle-threading task, an application of deformable linear object (DLO) manipulation. This task is divided into two main stages: Tail-end Finding and Tail-end Insertion. In the first stage, the agent traces the contour of the thread twice using vision-based tactile sensors mounted on the gripper fingers. The two-run tracing is to locate the tail-end of the thread. In the second stage, it employs a tactile-guided reinforcement learning (RL) model to drive the robot to insert the thread into the target needle eyelet. The RL model is trained in a Unity-based simulated environment. The simulation environment supports tactile rendering which can produce realistic tactile images and thread modeling. During insertion, the position of the poke point and the center of the eyelet are obtained through a pre-trained segmentation model, Grounded-SAM, which predicts the masks for both the needle eye and thread imprints. These positions are then fed into the reinforcement learning model, aiding in a smoother transition to real-world applications. Extensive experiments on real robots are conducted to demonstrate the efficacy of our method. More experiments and videos can be found in the supplementary materials and on the website: https://sites.google.com/view/tac-needlethreading.