Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonio Loquercio

University of California Berkeley

AI Coaching for Accelerating Human Skill Development with Reinforcement Learning

Jun 24, 2026

Wei Wang, Enlin Gu, Antonio Loquercio, Haimin Hu, Rahul Mangharam

Abstract:AI copilots can substantially boost human performance through shared control, but excessive assistance can induce over-reliance and skill atrophy. This paper studies how an embodied AI agent can act as a coach that accelerates human motor-skill development. We argue that effective coaching requires strategic scaffolding and stepping back that are aligned with the learner's capability, allowing productive failures that drive learning. We formalize the interactive AI coaching process as a non-cooperative dynamic game in which the learner optimizes task performance while the coach targets the learner's independent competence. Building on this formalism, we develop a reinforcement learning framework combining adaptive shared control with probabilistic models of the coach's causal influence on skill evolution, enabling tractable training of coaching policies. A comprehensive user study (N=33) on first-person-view drone racing shows significant gains in human learning outcomes over state-of-the-art AI coaching baselines.

Via

Access Paper or Ask Questions

RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory

Jun 23, 2026

Yixun Hu, Zhicheng Zheng, Lihan Zha, Chunwei Xing, Rajdeep Singh, Omar Hossain, Antonio Loquercio, Dhruv Shah

Abstract:Long-term robot deployment requires a compact and scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval. In this paper, we propose RAVEN, an agentic memory system for long-horizon robotic question answering and navigation. RAVEN stores visual embeddings with pose and time in a vector database, and grounds retrieval in a spatial map to answer queries and navigate to goals. By operating directly on visual embeddings, RAVEN avoids lossy image-to-text captioning and enables accurate semantic, spatial, and temporal retrieval at scale. Across several simulated and real-world video question-answering benchmarks, RAVEN consistently surpasses caption-based memory systems and matches frontier VLMs on long-horizon tasks at 10$\times$ lower retrieval cost. Finally, we instantiate RAVEN on a Unitree Go1 robot for the task of long-horizon navigation for natural language goal-reaching, and show successful deployment over several large indoor environments.

* Project website: https://ravenmem.github.io/

Via

Access Paper or Ask Questions

Reinforcement Learning for Flow-Matching Policies with Density Transport

Jun 07, 2026

Boshu Lei, Kostas Daniilidis, Antonio Loquercio

Abstract:We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.

Via

Access Paper or Ask Questions

RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

Apr 08, 2026

Wenjing Margaret Mao, Jefferson Ng, Luyang Hu, Daniel Gehrig, Antonio Loquercio

Abstract:Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning. For videos, data and more, visit the project webpage: https://roshi-mocap.github.io/

* 8 pages, 4 figures. *Equal contribution by first three authors. Project webpage: https://roshi-mocap.github.io/

Via

Access Paper or Ask Questions

Agile Flight Emerges from Multi-Agent Competitive Racing

Dec 12, 2025

Vineet Pasumarti, Lorenzo Bianchi, Antonio Loquercio

Abstract:Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: https://github.com/Jirl-upenn/AgileFlight_MultiAgent

Via

Access Paper or Ask Questions

Deep Sensorimotor Control by Imitating Predictive Models of Human Motion

Aug 26, 2025

Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, Antonio Loquercio

Figure 1 for Deep Sensorimotor Control by Imitating Predictive Models of Human Motion

Figure 2 for Deep Sensorimotor Control by Imitating Predictive Models of Human Motion

Figure 3 for Deep Sensorimotor Control by Imitating Predictive Models of Human Motion

Figure 4 for Deep Sensorimotor Control by Imitating Predictive Models of Human Motion

Abstract:As the embodiment gap between a robot and a human narrows, new opportunities arise to leverage datasets of humans interacting with their surroundings for robot learning. We propose a novel technique for training sensorimotor policies with reinforcement learning by imitating predictive models of human motions. Our key insight is that the motion of keypoints on human-inspired robot end-effectors closely mirrors the motion of corresponding human body keypoints. This enables us to use a model trained to predict future motion on human data \emph{zero-shot} on robot data. We train sensorimotor policies to track the predictions of such a model, conditioned on a history of past robot states, while optimizing a relatively sparse task reward. This approach entirely bypasses gradient-based kinematic retargeting and adversarial losses, which limit existing methods from fully leveraging the scale and diversity of modern human-scene interaction datasets. Empirically, we find that our approach can work across robots and tasks, outperforming existing baselines by a large margin. In addition, we find that tracking a human motion model can substitute for carefully designed dense rewards and curricula in manipulation tasks. Code, data and qualitative results available at https://jirl-upenn.github.io/track_reward/.

* Blog Post: https://hgaurav2k.github.io/trackr/

Via

Access Paper or Ask Questions

Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

Jun 11, 2025

Yiming Dou, Wonseok Oh, Yuqing Luo, Antonio Loquercio, Andrew Owens

Abstract:We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: https://www.yimingdou.com/hearing_hands/

* CVPR 2025, Project page: https://www.yimingdou.com/hearing_hands/ , Code: https://github.com/Dou-Yiming/hearing_hands/

Via

Access Paper or Ask Questions

End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Nov 08, 2024

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio

Figure 1 for End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Figure 2 for End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Figure 3 for End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Figure 4 for End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering

Abstract:We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/

Via

Access Paper or Ask Questions

Hand-Object Interaction Pretraining from Videos

Sep 12, 2024

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik

Figure 1 for Hand-Object Interaction Pretraining from Videos

Figure 2 for Hand-Object Interaction Pretraining from Videos

Figure 3 for Hand-Object Interaction Pretraining from Videos

Figure 4 for Hand-Object Interaction Pretraining from Videos

Abstract:We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \url{https://hgaurav2k.github.io/hop/}.

Via

Access Paper or Ask Questions

Tactile-Augmented Radiance Fields

May 07, 2024

Yiming Dou, Fengyu Yang, Yi Liu, Antonio Loquercio, Andrew Owens

Figure 1 for Tactile-Augmented Radiance Fields

Figure 2 for Tactile-Augmented Radiance Fields

Figure 3 for Tactile-Augmented Radiance Fields

Figure 4 for Tactile-Augmented Radiance Fields

Abstract:We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space. This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene. We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes. Our approach makes use of two insights: (i) common vision-based touch sensors are built on ordinary cameras and thus can be registered to images using methods from multi-view geometry, and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to register touch signals to a captured visual scene, and to train a conditional diffusion model that, provided with an RGB-D image rendered from a neural radiance field, generates its corresponding tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This dataset contains more touch samples than previous real-world datasets, and it provides spatially aligned visual signals for each captured touch signal. We demonstrate the accuracy of our cross-modal generative model and the utility of the captured visual-tactile data on several downstream tasks. Project page: https://dou-yiming.github.io/TaRF

* CVPR 2024, Project page: https://dou-yiming.github.io/TaRF, Code: https://github.com/Dou-Yiming/TaRF/

Via

Access Paper or Ask Questions