Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marco Pavone

Sanford University and

Online Aggregation of Trajectory Predictors

Feb 11, 2025

Alex Tong, Apoorva Sharma, Sushant Veer, Marco Pavone, Heng Yang

Figure 1 for Online Aggregation of Trajectory Predictors

Figure 2 for Online Aggregation of Trajectory Predictors

Figure 3 for Online Aggregation of Trajectory Predictors

Figure 4 for Online Aggregation of Trajectory Predictors

Abstract:Trajectory prediction, the task of forecasting future agent behavior from past data, is central to safe and efficient autonomous driving. A diverse set of methods (e.g., rule-based or learned with different architectures and datasets) have been proposed, yet it is often the case that the performance of these methods is sensitive to the deployment environment (e.g., how well the design rules model the environment, or how accurately the test data match the training data). Building upon the principled theory of online convex optimization but also going beyond convexity and stationarity, we present a lightweight and model-agnostic method to aggregate different trajectory predictors online. We propose treating each individual trajectory predictor as an "expert" and maintaining a probability vector to mix the outputs of different experts. Then, the key technical approach lies in leveraging online data -the true agent behavior to be revealed at the next timestep- to form a convex-or-nonconvex, stationary-or-dynamic loss function whose gradient steers the probability vector towards choosing the best mixture of experts. We instantiate this method to aggregate trajectory predictors trained on different cities in the NUSCENES dataset and show that it performs just as well, if not better than, any singular model, even when deployed on the out-of-distribution LYFT dataset.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

Surprise Potential as a Measure of Interactivity in Driving Scenarios

Feb 08, 2025

Wenhao Ding, Sushant Veer, Karen Leung, Yulong Cao, Marco Pavone

Abstract:Validating the safety and performance of an autonomous vehicle (AV) requires benchmarking on real-world driving logs. However, typical driving logs contain mostly uneventful scenarios with minimal interactions between road users. Identifying interactive scenarios in real-world driving logs enables the curation of datasets that amplify critical signals and provide a more accurate assessment of an AV's performance. In this paper, we present a novel metric that identifies interactive scenarios by measuring an AV's surprise potential on others. First, we identify three dimensions of the design space to describe a family of surprise potential measures. Second, we exhaustively evaluate and compare different instantiations of the surprise potential measure within this design space on the nuScenes dataset. To determine how well a surprise potential measure correctly identifies an interactive scenario, we use a reward model learned from human preferences to assess alignment with human intuition. Our proposed surprise potential, arising from this exhaustive comparative study, achieves a correlation of more than 0.82 with the human-aligned reward function, outperforming existing approaches. Lastly, we validate motion planners on curated interactive scenarios to demonstrate downstream applications.

* 10 pages, 8 figures

Via

Access Paper or Ask Questions

DreamDrive: Generative 4D Scene Modeling from Street View Images

Jan 03, 2025

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang

Figure 1 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Figure 2 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Figure 3 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Figure 4 for DreamDrive: Generative 4D Scene Modeling from Street View Images

Abstract:Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

* Project page: https://pointscoder.github.io/DreamDrive/

Via

Access Paper or Ask Questions

STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Dec 31, 2024

Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu(+3 more)

Figure 1 for STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Figure 2 for STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Figure 3 for STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Figure 4 for STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Abstract:We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations--parameterized by 3D Gaussians and their velocities--in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.

* Project page at: https://jiawei-yang.github.io/STORM/

Via

Access Paper or Ask Questions

LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Dec 10, 2024

Ziqi Lu, Heng Yang, Danfei Xu, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

Figure 1 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Figure 2 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Figure 3 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Figure 4 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Abstract:Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to $\textit{specialize}$ the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a $\textbf{single standard GPU within just 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ of storage. We evaluated our method on $\textbf{more than 160 scenes}$ from the Replica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performance improvement}$ on 3D reconstruction, multi-view pose estimation and novel-view rendering.

Via

Access Paper or Ask Questions

Extrapolated Urban View Synthesis Benchmark

Dec 10, 2024

Xiangyu Han, Zhen Jia, Boyi Li, Yan Wang, Boris Ivanovic, Yurong You, Lingjie Liu, Yue Wang, Marco Pavone, Chen Feng(+1 more)

Figure 1 for Extrapolated Urban View Synthesis Benchmark

Figure 2 for Extrapolated Urban View Synthesis Benchmark

Figure 3 for Extrapolated Urban View Synthesis Benchmark

Figure 4 for Extrapolated Urban View Synthesis Benchmark

Abstract:Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting methods across different difficulty levels. Our results show that Gaussian Splatting is prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We have released our data to help advance self-driving and urban robotics simulation technology.

* Project page: https://ai4ce.github.io/EUVS-Benchmark/

Via

Access Paper or Ask Questions

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Dec 05, 2024

Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, Marco Pavone

Figure 1 for Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Figure 2 for Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Figure 3 for Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Figure 4 for Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Abstract:Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. The code is available at https://github.com/NVlabs/catk.

* Project Page: https://zhejz.github.io/catk/

Via

Access Paper or Ask Questions

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Nov 23, 2024

Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang

Figure 1 for Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Figure 2 for Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Figure 3 for Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Figure 4 for Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Abstract:Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Learning Multiple Initial Solutions to Optimization Problems

Nov 04, 2024

Elad Sharony, Heng Yang, Tong Che, Marco Pavone, Shie Mannor, Peter Karkus

Figure 1 for Learning Multiple Initial Solutions to Optimization Problems

Figure 2 for Learning Multiple Initial Solutions to Optimization Problems

Figure 3 for Learning Multiple Initial Solutions to Optimization Problems

Figure 4 for Learning Multiple Initial Solutions to Optimization Problems

Abstract:Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propose learning to predict \emph{multiple} diverse initial solutions given parameters that define the problem instance. We introduce two strategies for utilizing multiple initial solutions: (i) a single-optimizer approach, where the most promising initial solution is chosen using a selection function, and (ii) a multiple-optimizers approach, where several optimizers, potentially run in parallel, are each initialized with a different solution, with the best solution chosen afterward. We validate our method on three optimal control benchmark tasks: cart-pole, reacher, and autonomous driving, using different optimizers: DDP, MPPI, and iLQR. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required. The code is available at $\href{https://github.com/EladSharony/miso}{\tt{https://github.com/EladSharony/miso}}$.

* Under Review

Via

Access Paper or Ask Questions

Transformer-based Model Predictive Control: Trajectory Optimization via Sequence Modeling

Oct 31, 2024

Davide Celestini, Daniele Gammelli, Tommaso Guffanti, Simone D'Amico, Elisa Capello, Marco Pavone

Figure 1 for Transformer-based Model Predictive Control: Trajectory Optimization via Sequence Modeling

Figure 2 for Transformer-based Model Predictive Control: Trajectory Optimization via Sequence Modeling

Figure 3 for Transformer-based Model Predictive Control: Trajectory Optimization via Sequence Modeling

Figure 4 for Transformer-based Model Predictive Control: Trajectory Optimization via Sequence Modeling

Abstract:Model predictive control (MPC) has established itself as the primary methodology for constrained control, enabling general-purpose robot autonomy in diverse real-world scenarios. However, for most problems of interest, MPC relies on the recursive solution of highly non-convex trajectory optimization problems, leading to high computational complexity and strong dependency on initialization. In this work, we present a unified framework to combine the main strengths of optimization-based and learning-based methods for MPC. Our approach entails embedding high-capacity, transformer-based neural network models within the optimization process for trajectory generation, whereby the transformer provides a near-optimal initial guess, or target plan, to a non-convex optimization problem. Our experiments, performed in simulation and the real world onboard a free flyer platform, demonstrate the capabilities of our framework to improve MPC convergence and runtime. Compared to purely optimization-based approaches, results show that our approach can improve trajectory generation performance by up to 75%, reduce the number of solver iterations by up to 45%, and improve overall MPC runtime by 7x without loss in performance.

* IEEE Robotics and Automation Letters, vol. 9, n. 11, pp. 9820-9827, Nov. 2024
* 8 pages, 7 figures. Datasets, videos and code available at: https://transformermpc.github.io

Via

Access Paper or Ask Questions