Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marco Pavone

Sanford University and

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Oct 24, 2024

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu(+3 more)

Figure 1 for Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Figure 2 for Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Figure 3 for Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Figure 4 for Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Abstract:Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

* Project Website: https://largespatialmodel.github.io

Via

Access Paper or Ask Questions

Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Oct 16, 2024

Minkyoung Cho, Yulong Cao, Jiachen Sun, Qingzhao Zhang, Marco Pavone, Jeong Joon Park, Heng Yang, Z. Morley Mao

Figure 1 for Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Figure 2 for Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Figure 3 for Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Figure 4 for Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Abstract:An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.

* 23 pages

Via

Access Paper or Ask Questions

LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

Oct 15, 2024

Christopher Diehl, Peter Karkus, Sushant Veer, Marco Pavone, Torsten Bertram

Figure 1 for LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

Figure 2 for LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

Figure 3 for LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

Figure 4 for LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

Abstract:Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop OOD driving score by 8.83% in comparison to standard fine-tuning.

* Under Review

Via

Access Paper or Ask Questions

Generalizable Spacecraft Trajectory Generation via Multimodal Learning with Transformers

Oct 15, 2024

Davide Celestini, Amirhossein Afsharrad, Daniele Gammelli, Tommaso Guffanti, Gioele Zardini, Sanjay Lall, Elisa Capello, Simone D'Amico, Marco Pavone

Figure 1 for Generalizable Spacecraft Trajectory Generation via Multimodal Learning with Transformers

Figure 2 for Generalizable Spacecraft Trajectory Generation via Multimodal Learning with Transformers

Figure 3 for Generalizable Spacecraft Trajectory Generation via Multimodal Learning with Transformers

Figure 4 for Generalizable Spacecraft Trajectory Generation via Multimodal Learning with Transformers

Abstract:Effective trajectory generation is essential for reliable on-board spacecraft autonomy. Among other approaches, learning-based warm-starting represents an appealing paradigm for solving the trajectory generation problem, effectively combining the benefits of optimization- and data-driven methods. Current approaches for learning-based trajectory generation often focus on fixed, single-scenario environments, where key scene characteristics, such as obstacle positions or final-time requirements, remain constant across problem instances. However, practical trajectory generation requires the scenario to be frequently reconfigured, making the single-scenario approach a potentially impractical solution. To address this challenge, we present a novel trajectory generation framework that generalizes across diverse problem configurations, by leveraging high-capacity transformer neural networks capable of learning from multimodal data sources. Specifically, our approach integrates transformer-based neural network models into the trajectory optimization process, encoding both scene-level information (e.g., obstacle locations, initial and goal states) and trajectory-level constraints (e.g., time bounds, fuel consumption targets) via multimodal representations. The transformer network then generates near-optimal initial guesses for non-convex optimization problems, significantly enhancing convergence speed and performance. The framework is validated through extensive simulations and real-world experiments on a free-flyer platform, achieving up to 30% cost improvement and 80% reduction in infeasible cases with respect to traditional approaches, and demonstrating robust generalization across diverse scenario variations.

* 8 pages, 6 figures, submitted to 2025 American Control Conference (ACC)

Via

Access Paper or Ask Questions

Offline Hierarchical Reinforcement Learning via Inverse Optimization

Oct 10, 2024

Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues

Figure 1 for Offline Hierarchical Reinforcement Learning via Inverse Optimization

Figure 2 for Offline Hierarchical Reinforcement Learning via Inverse Optimization

Figure 3 for Offline Hierarchical Reinforcement Learning via Inverse Optimization

Figure 4 for Offline Hierarchical Reinforcement Learning via Inverse Optimization

Abstract:Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.

Via

Access Paper or Ask Questions

Towards Robust Spacecraft Trajectory Optimization via Transformers

Oct 08, 2024

Yuji Takubo, Tommaso Guffanti, Daniele Gammelli, Marco Pavone, Simone D'Amico

Abstract:Future multi-spacecraft missions require robust autonomous trajectory optimization capabilities to ensure safe and efficient rendezvous operations. This capability hinges on solving non-convex optimal control problems in real time, although traditional iterative methods such as sequential convex programming impose significant computational challenges. To mitigate this burden, the Autonomous Rendezvous Transformer introduced a generative model trained to provide near-optimal initial guesses. This approach provides convergence to better local optima (e.g., fuel optimality), improves feasibility rates, and results in faster convergence speed of optimization algorithms through warm-starting. This work extends the capabilities of ART to address robust chance-constrained optimal control problems. Specifically, ART is applied to challenging rendezvous scenarios in Low Earth Orbit (LEO), ensuring fault-tolerant behavior under uncertainty. Through extensive experimentation, the proposed warm-starting strategy is shown to consistently produce high-quality reference trajectories, achieving up to 30% cost improvement and 50% reduction in infeasible cases compared to conventional methods, demonstrating robust performance across multiple state representations. Additionally, a post hoc evaluation framework is proposed to assess the quality of generated trajectories and mitigate runtime failures, marking an initial step toward the reliable deployment of AI-driven solutions in safety-critical autonomous systems such as spacecraft.

* Submitted to the IEEE Aerospace Conference 2025. 13 pages, 10 figures

Via

Access Paper or Ask Questions

Gen-Drive: Enhancing Diffusion Generative Driving Policies with Reward Modeling and Reinforcement Learning Fine-tuning

Oct 08, 2024

Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, Chen Lv

Abstract:Autonomous driving necessitates the ability to reason about future interactions between traffic agents and to make informed evaluations for planning. This paper introduces the \textit{Gen-Drive} framework, which shifts from the traditional prediction and deterministic planning framework to a generation-then-evaluation planning paradigm. The framework employs a behavior diffusion model as a scene generator to produce diverse possible future scenarios, thereby enhancing the capability for joint interaction reasoning. To facilitate decision-making, we propose a scene evaluator (reward) model, trained with pairwise preference data collected through VLM assistance, thereby reducing human workload and enhancing scalability. Furthermore, we utilize an RL fine-tuning framework to improve the generation quality of the diffusion model, rendering it more effective for planning tasks. We conduct training and closed-loop planning tests on the nuPlan dataset, and the results demonstrate that employing such a generation-then-evaluation strategy outperforms other learning-based approaches. Additionally, the fine-tuned generative driving policy shows significant enhancements in planning performance. We further demonstrate that utilizing our learned reward model for evaluation or RL fine-tuning leads to better planning performance compared to relying on human-designed rewards. Project website: https://mczhi.github.io/GenDrive.

Via

Access Paper or Ask Questions

Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Oct 06, 2024

Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, Jeannette Bohg

Figure 1 for Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Figure 2 for Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Figure 3 for Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Figure 4 for Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Abstract:Robot behavior policies trained via imitation learning are prone to failure under conditions that deviate from their training data. Thus, algorithms that monitor learned policies at test time and provide early warnings of failure are necessary to facilitate scalable deployment. We propose Sentinel, a runtime monitoring framework that splits the detection of failures into two complementary categories: 1) Erratic failures, which we detect using statistical measures of temporal action consistency, and 2) task progression failures, where we use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. Our approach has two key strengths. First, because learned policies exhibit diverse failure modes, combining complementary detectors leads to significantly higher accuracy at failure detection. Second, using a statistical temporal action consistency measure ensures that we quickly detect when multimodal, generative policies exhibit erratic behavior at negligible computational cost. In contrast, we only use VLMs to detect failure modes that are less time-sensitive. We demonstrate our approach in the context of diffusion policies trained on robotic mobile manipulation domains in both simulation and the real world. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone and significantly outperforms baselines, thus highlighting the importance of assigning specialized detectors to complementary categories of failure. Qualitative results are made available at https://sites.google.com/stanford.edu/sentinel.

* Project page: https://sites.google.com/stanford.edu/sentinel . 35 pages, 9 figures. Accepted to the Conference on Robot Learning (CoRL) 2024

Via

Access Paper or Ask Questions

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Oct 03, 2024

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li(+2 more)

Figure 1 for LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Figure 2 for LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Figure 3 for LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Figure 4 for LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Abstract:This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

Via

Access Paper or Ask Questions

System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles

Sep 26, 2024

Kaustav Chakraborty, Zeyuan Feng, Sushant Veer, Apoorva Sharma, Boris Ivanovic, Marco Pavone, Somil Bansal

Figure 1 for System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles

Figure 2 for System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles

Figure 3 for System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles

Figure 4 for System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles

Abstract:The safety-critical nature of autonomous vehicle (AV) operation necessitates development of task-relevant algorithms that can reason about safety at the system level and not just at the component level. To reason about the impact of a perception failure on the entire system performance, such task-relevant algorithms must contend with various challenges: complexity of AV stacks, high uncertainty in the operating environments, and the need for real-time performance. To overcome these challenges, in this work, we introduce a Q-network called SPARQ (abbreviation for Safety evaluation for Perception And Recovery Q-network) that evaluates the safety of a plan generated by a planning algorithm, accounting for perception failures that the planning process may have overlooked. This Q-network can be queried during system runtime to assess whether a proposed plan is safe for execution or poses potential safety risks. If a violation is detected, the network can then recommend a corrective plan while accounting for the perceptual failure. We validate our algorithm using the NuPlan-Vegas dataset, demonstrating its ability to handle cases where a perception failure compromises a proposed plan while the corrective plan remains safe. We observe an overall accuracy and recall of 90% while sustaining a frequency of 42Hz on the unseen testing dataset. We compare our performance to a popular reachability-based baseline and analyze some interesting properties of our approach in improving the safety properties of an AV pipeline.

Via

Access Paper or Ask Questions