Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marco Pavone

Sanford University and

Vision Foundation Model Embedding-Based Semantic Anomaly Detection

May 12, 2025

Max Peter Ronecker, Matthew Foutter, Amine Elhafsi, Daniele Gammelli, Ihor Barakaiev, Marco Pavone, Daniel Watzenig

Abstract:Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system-level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid-based embeddings, and another leveraging instance segmentation for object-centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA-simulated anomalies show that the instance-based method with filtering achieves performance comparable to GPT-4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real-time anomaly detection in autonomous systems.

* Accepted for the Workshop "Safely Leveraging Vision-Language Foundation Models in Robotics: Challenges and Opportunities" at ICRA 2025

Via

Access Paper or Ask Questions

Flight Validation of Learning-Based Trajectory Optimization for the Astrobee Free-Flyer

May 08, 2025

Somrita Banerjee, Abhishek Cauligi, Marco Pavone

Abstract:Although widely used in commercial and industrial robotics, trajectory optimization has seen limited use in space applications due to its high computational demands. In this work, we present flight results from experiments with the Astrobee free-flying robot on board the International Space Station (ISS), that demonstrate how machine learning can accelerate on-board trajectory optimization while preserving theoretical solver guarantees. To the best of the authors' knowledge, this is the first-ever demonstration of learning-based control on the ISS. Our approach leverages the GuSTO sequential convex programming framework and uses a neural network, trained offline, to map problem parameters to effective initial ``warm-start'' trajectories, paving the way for faster real-time optimization on resource-constrained space platforms.

* Submitted to RSS 2025 Workshop on Space Robotics

Via

Access Paper or Ask Questions

Deformable Cargo Transport in Microgravity with Astrobee

May 02, 2025

Daniel Morton, Rika Antonova, Brian Coltin, Marco Pavone, Jeannette Bohg

Abstract:We present pyastrobee: a simulation environment and control stack for Astrobee in Python, with an emphasis on cargo manipulation and transport tasks. We also demonstrate preliminary success from a sampling-based MPC controller, using reduced-order models of NASA's cargo transfer bag (CTB) to control a high-order deformable finite element model. Our code is open-source, fully documented, and available at https://danielpmorton.github.io/pyastrobee

Via

Access Paper or Ask Questions

Describe Anything: Detailed Localized Image and Video Captioning

Apr 22, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala(+1 more)

Abstract:Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

* Project page: https://describe-anything.github.io/

Via

Access Paper or Ask Questions

Robo-taxi Fleet Coordination at Scale via Reinforcement Learning

Apr 09, 2025

Luigi Tresca, Carolin Schmidt, James Harrison, Filipe Rodrigues, Gioele Zardini, Daniele Gammelli, Marco Pavone

Abstract:Fleets of robo-taxis offering on-demand transportation services, commonly known as Autonomous Mobility-on-Demand (AMoD) systems, hold significant promise for societal benefits, such as reducing pollution, energy consumption, and urban congestion. However, orchestrating these systems at scale remains a critical challenge, with existing coordination algorithms often failing to exploit the systems' full potential. This work introduces a novel decision-making framework that unites mathematical modeling with data-driven techniques. In particular, we present the AMoD coordination problem through the lens of reinforcement learning and propose a graph network-based framework that exploits the main strengths of graph representation learning, reinforcement learning, and classical operations research tools. Extensive evaluations across diverse simulation fidelities and scenarios demonstrate the flexibility of our approach, achieving superior system performance, computational efficiency, and generalizability compared to prior methods. Finally, motivated by the need to democratize research efforts in this area, we release publicly available benchmarks, datasets, and simulators for network-level coordination alongside an open-source codebase designed to provide accessible simulation platforms and establish a standardized validation process for comparing methodologies. Code available at: https://github.com/StanfordASL/RL4AMOD

* 12 pages, 6 figures, 6 tables

Via

Access Paper or Ask Questions

Data Scaling Laws for End-to-End Autonomous Driving

Apr 06, 2025

Alexander Naumann, Xunjiang Gu, Tolga Dimlioglu, Mariusz Bojarski, Alperen Degirmenci, Alexander Popov, Devansh Bisla, Marco Pavone, Urs Müller, Boris Ivanovic

Figure 1 for Data Scaling Laws for End-to-End Autonomous Driving

Figure 2 for Data Scaling Laws for End-to-End Autonomous Driving

Figure 3 for Data Scaling Laws for End-to-End Autonomous Driving

Figure 4 for Data Scaling Laws for End-to-End Autonomous Driving

Abstract:Autonomous vehicle (AV) stacks have traditionally relied on decomposed approaches, with separate modules handling perception, prediction, and planning. However, this design introduces information loss during inter-module communication, increases computational overhead, and can lead to compounding errors. To address these challenges, recent works have proposed architectures that integrate all components into an end-to-end differentiable model, enabling holistic system optimization. This shift emphasizes data engineering over software integration, offering the potential to enhance system performance by simply scaling up training resources. In this work, we evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours with both open-loop metrics and closed-loop simulations. Specifically, we investigate how much additional training data is needed to achieve a target performance gain, e.g., a 5% improvement in motion prediction accuracy. By understanding the relationship between model performance and training dataset size, we aim to provide insights for data-driven decision-making in autonomous driving development.

* 15 pages, 11 figures, 4 tables, CVPR 2025 Workshop on Autonomous Driving

Via

Access Paper or Ask Questions

Taming High-Dimensional Dynamics: Learning Optimal Projections onto Spectral Submanifolds

Apr 04, 2025

Hugo Buurmeijer, Luis A. Pabon, John Irvin Alora, Roshan S. Kaundinya, George Haller, Marco Pavone

Figure 1 for Taming High-Dimensional Dynamics: Learning Optimal Projections onto Spectral Submanifolds

Figure 2 for Taming High-Dimensional Dynamics: Learning Optimal Projections onto Spectral Submanifolds

Figure 3 for Taming High-Dimensional Dynamics: Learning Optimal Projections onto Spectral Submanifolds

Figure 4 for Taming High-Dimensional Dynamics: Learning Optimal Projections onto Spectral Submanifolds

Abstract:High-dimensional nonlinear systems pose considerable challenges for modeling and control across many domains, from fluid mechanics to advanced robotics. Such systems are typically approximated with reduced order models, which often rely on orthogonal projections, a simplification that may lead to large prediction errors. In this work, we derive optimality of fiber-aligned projections onto spectral submanifolds, preserving the nonlinear geometric structure and minimizing long-term prediction error. We propose a computationally tractable procedure to approximate these projections from data, and show how the effect of control can be incorporated. For a 180-dimensional robotic system, we demonstrate that our reduced-order models outperform previous state-of-the-art approaches by up to fivefold in trajectory tracking accuracy under model predictive control.

Via

Access Paper or Ask Questions

Scaling Vision Pre-Training to 4K Resolution

Mar 25, 2025

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov(+1 more)

Abstract:High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.

* CVPR 2025. Project Page: https://nvlabs.github.io/PS3

Via

Access Paper or Ask Questions

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

Mar 13, 2025

Roshan S. Kaundinya, John Irvin Alora, Jonas G. Matt, Luis A. Pabon, Marco Pavone, George Haller

Abstract:The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of this data-driven model on various dynamic trajectory tracking tasks on a high-fidelity and high-dimensional finite-element model of a soft trunk robot. Notably, we find that four- or five-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.

* 33 pages, 19 figures

Via

Access Paper or Ask Questions

Safe, Task-Consistent Manipulation with Operational Space Control Barrier Functions

Mar 09, 2025

Daniel Morton, Marco Pavone

Figure 1 for Safe, Task-Consistent Manipulation with Operational Space Control Barrier Functions

Figure 2 for Safe, Task-Consistent Manipulation with Operational Space Control Barrier Functions

Figure 3 for Safe, Task-Consistent Manipulation with Operational Space Control Barrier Functions

Figure 4 for Safe, Task-Consistent Manipulation with Operational Space Control Barrier Functions

Abstract:Safe real-time control of robotic manipulators in unstructured environments requires handling numerous safety constraints without compromising task performance. Traditional approaches, such as artificial potential fields (APFs), suffer from local minima, oscillations, and limited scalability, while model predictive control (MPC) can be computationally expensive. Control barrier functions (CBFs) offer a promising alternative due to their high level of robustness and low computational cost, but these safety filters must be carefully designed to avoid significant reductions in the overall performance of the manipulator. In this work, we introduce an Operational Space Control Barrier Function (OSCBF) framework that integrates safety constraints while preserving task-consistent behavior. Our approach scales to hundreds of simultaneous constraints while retaining real-time control rates, ensuring collision avoidance, singularity prevention, and workspace containment even in highly cluttered and dynamic settings. By explicitly accounting for the task hierarchy in the CBF objective, we prevent degraded performance across both joint-space and operational-space tasks, when at the limit of safety. Our open-source, high-performance software will be available at our project webpage, https://stanfordasl.github.io/oscbf/

Via

Access Paper or Ask Questions