Abstract:To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.
Abstract:While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
Abstract:In human-robot collaboration (HRC), robots must adapt online to dynamic task constraints and evolving human intent. While physical corrections provide a natural, low-latency channel for operators to convey motion-level adjustments, extracting task-level semantic intent from such brief interactions remains challenging. Existing foundation-model-based approaches primarily rely on vision and language inputs and lack mechanisms to interpret physical feedback. Meanwhile, traditional physical human-robot interaction (pHRI) methods leverage physical corrections for trajectory guidance but struggle to infer task-level semantics. To bridge this gap, we propose TATIC, a unified framework that utilizes torque-based contact force estimation and a task-aware Temporal Convolutional Network (TCN) to jointly infer discrete task-level intent and estimate continuous motion-level parameters from brief physical corrections. Task-aligned feature canonicalization ensures robust generalization across diverse layouts, while an intent-driven adaptation scheme translates inferred human intent into robot motion adaptations. Experiments achieve a 0.904 Macro-F1 score in intent recognition and demonstrate successful hardware validation in collaborative disassembly (see experimental video at https://youtu.be/xF8A52qwEc8).
Abstract:Disassembly automation has long been pursued to address the growing demand for efficient and proper recovery of valuable components from the end-of-life (EoL) electronic products. Existing approaches have demonstrated promising and regimented performance by decomposing the disassembly process into different subtasks. However, each subtask typically requires extensive data preparation, model training, and system management. Moreover, these approaches are often task- and component-specific, making them poorly suited to handle the variability and uncertainty of EoL products and limiting their generalization capabilities. All these factors restrict the practical deployment of current robotic disassembly systems and leave them highly reliant on human labor. With the recent development of foundation models in robotics, vision-language-action (VLA) models have shown impressive performance on standard robotic manipulation tasks, but their applicability to complex, contact-rich, and long-horizon industrial practices like disassembly, which requires sequential and precise manipulation, remains limited. To address this challenge, we propose SELF-VLA, an agentic VLA framework that integrates explicit disassembly skills. Experimental studies demonstrate that our framework significantly outperforms current state-of-the-art end-to-end VLA models on two contact-rich disassembly tasks. The video illustration can be found via https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/IROS-VLA-Video.mp4.
Abstract:This paper introduces a new algorithm for trajectory optimization, Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization (DRAFTO). It first constructs a constrained objective that accounts for smoothness, safety, joint limits, and task requirements. Then, it optimizes the coefficients, which are the coordinates of a set of basis functions for trajectory parameterization. To reduce the number of repeated constrained optimizations while handling joint-limit feasibility, the optimization is decoupled into a reduced-space Gauss-Newton (GN) descent for the main iterations and constrained quadratic programming for initialization and terminal feasibility repair. The two-phase acceptance rule with a non-monotone policy is applied to the GN model, which uses a hinge-squared penalty for inequality constraints, to ensure globalizability. The results of our benchmark tests against optimization-based planners, such as CHOMP, TrajOpt, GPMP2, and FACTO, and sampling-based planners, such as RRT-Connect, RRT*, and PRM, validate the high efficiency and reliability across diverse scenarios and tasks. The experiment involving grabbing an object from a drawer further demonstrates the potential for implementation in complex manipulation tasks. The supplemental video is available at https://youtu.be/XisFI37YyTQ.
Abstract:Robotic disassembly involves contact-rich interactions in which successful manipulation depends not only on geometric alignment but also on force-dependent state transitions. While vision-based policies perform well in structured settings, their reliability often degrades in tight-tolerance, contact-dominated, or deformable scenarios. In this work, we systematically investigate the role of tactile sensing in robotic disassembly through both simulation and real-world experiments. We construct five rigid-body disassembly tasks in simulation with increasing geometric constraints and extraction difficulty. We further design five real-world tasks, including three rigid and two deformable scenarios, to evaluate contact-dependent manipulation. Within a unified learning framework, we compare three sensing configurations: Vision Only, Vision + tactile RGB (TacRGB), and Vision + tactile force field (TacFF). Across both simulation and real-world experiments, TacFF-based policies consistently achieve the highest success rates, with particularly notable gains in contact-dependent and deformable settings. Notably, naive fusion of TacRGB and TacFF underperforms either modality alone, indicating that simple concatenation can dilute task-relevant force information. Our results show that tactile sensing plays a critical, task-dependent role in robotic disassembly, with structured force-field representations being particularly effective in contact-dominated scenarios.
Abstract:Sampling-based motion planning algorithms are widely used for motion planning of robotic manipulators, but they often struggle with sample inefficiency in high-dimensional configuration spaces due to their reliance on uniform or hand-crafted informed sampling primitives. Neural informed samplers address this limitation by learning the sampling distribution from prior planning experience to guide the motion planner towards planning goal. However, existing approaches often struggle to encode the spatial structure inherent in motion planning problems. To address this limitation, we introduce Graph-based Attention Masking for Spatial- and Embodiment-aware Motion Planning (GAIDE), a neural informed sampler that leverages both the spatial structure of the planning problem and the robotic manipulator's embodiment to guide the planning algorithm. GAIDE represents these structures as a graph and integrates it into a transformer-based neural sampler through attention masking. We evaluate GAIDE against baseline state-of-the-art sampling-based planners using uniform sampling, hand-crafted informed sampling, and neural informed sampling primitives. Evaluation results demonstrate that GAIDE improves planning efficiency and success rate.
Abstract:This paper introduces Function-space Adaptive Constrained Trajectory Optimization (FACTO), a new trajectory optimization algorithm for both single- and multi-arm manipulators. Trajectory representations are parameterized as linear combinations of orthogonal basis functions, and optimization is performed directly in the coefficient space. The constrained problem formulation consists of both an objective functional and a finite-dimensional objective defined over truncated coefficients. To address nonlinearity, FACTO uses a Gauss-Newton approximation with exponential moving averaging, yielding a smoothed quadratic subproblem. Trajectory-wide constraints are addressed using coefficient-space mappings, and an adaptive constrained update using the Levenberg-Marquardt algorithm is performed in the null space of active constraints. Comparisons with optimization-based planners (CHOMP, TrajOpt, GPMP2) and sampling-based planners (RRT-Connect, RRT*, PRM) show the improved solution quality and feasibility, especially in constrained single- and multi-arm scenarios. The experimental evaluation of FACTO on Franka robots verifies the feasibility of deployment.
Abstract:The integration of foundation models (FMs) into robotics has accelerated real-world deployment, while introducing new safety challenges arising from open-ended semantic reasoning and embodied physical action. These challenges require safety notions beyond physical constraint satisfaction. In this paper, we characterize FM-enabled robot safety along three dimensions: action safety (physical feasibility and constraint compliance), decision safety (semantic and contextual appropriateness), and human-centered safety (conformance to human intent, norms, and expectations). We argue that existing approaches, including static verification, monolithic controllers, and end-to-end learned policies, are insufficient in settings where tasks, environments, and human expectations are open-ended, long-tailed, and subject to adaptation over time. To address this gap, we propose modular safety guardrails, consisting of monitoring (evaluation) and intervention layers, as an architectural foundation for comprehensive safety across the autonomy stack. Beyond modularity, we highlight possible cross-layer co-design opportunities through representation alignment and conservatism allocation to enable faster, less conservative, and more effective safety enforcement. We call on the community to explore richer guardrail modules and principled co-design strategies to advance safe real-world physical AI deployment.
Abstract:Stochastic human motion prediction is critical for safe and effective human-robot collaboration (HRC) in industrial remanufacturing, as it captures human motion uncertainties and multi-modal behaviors that deterministic methods cannot handle. While earlier works emphasize highly diverse predictions, they often generate unrealistic human motions. More recent methods focus on accuracy and real-time performance, yet there remains potential to improve prediction quality further without exceeding time budgets. Additionally, current research on stochastic human motion prediction in HRC typically considers human motion in isolation, neglecting the influence of robot motion on human behavior. To address these research gaps and enable real-time, realistic, and interaction-aware human motion prediction, we propose a novel prediction-refinement framework that integrates both human and robot observed motion to refine the initial predictions produced by a pretrained state-of-the-art predictor. The refinement module employs a Flow Matching structure to account for uncertainty. Experimental studies on the HRC desktop disassembly dataset demonstrate that our method significantly improves prediction accuracy while preserving the uncertainties and multi-modalities of human motion. Moreover, the total inference time of the proposed framework remains within the time budget, highlighting the effectiveness and practicality of our approach.