Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luca Carlone

Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation

Jul 01, 2025

Xihang Yu, Rajat Talak, Jingnan Shi, Ulrich Viereck, Igor Gilitschenski, Luca Carlone

Abstract:Modern warehouse automation systems rely on fleets of intelligent robots that generate vast amounts of data -- most of which remains unannotated. This paper develops a self-supervised domain adaptation pipeline that leverages real-world, unlabeled data to improve perception models without requiring manual annotations. Our work focuses specifically on estimating the pose and shape of boxes and presents a correct-and-certify pipeline for self-supervised box pose and shape estimation. We extensively evaluate our approach across a range of simulated and real industrial settings, including adaptation to a large-scale real-world dataset of 50,000 images. The self-supervised model significantly outperforms models trained solely in simulation and shows substantial improvements over a zero-shot 3D bounding box estimation baseline.

* 12 pages, 6 figures. This work will be presented at the 19th International Symposium on Experimental Robotics (ISER2025)

Via

Access Paper or Ask Questions

Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs

Jun 09, 2025

Jared Strader, Aaron Ray, Jacob Arkin, Mason B. Peterson, Yun Chang, Nathan Hughes, Christopher Bradley, Yi Xuan Jia, Carlos Nieto-Granda, Rajat Talak(+4 more)

Abstract:In this paper, we introduce a multi-robot system that integrates mapping, localization, and task and motion planning (TAMP) enabled by 3D scene graphs to execute complex instructions expressed in natural language. Our system builds a shared 3D scene graph incorporating an open-set object-based map, which is leveraged for multi-robot 3D scene graph fusion. This representation supports real-time, view-invariant relocalization (via the object-based map) and planning (via the 3D scene graph), allowing a team of robots to reason about their surroundings and execute complex tasks. Additionally, we introduce a planning approach that translates operator intent into Planning Domain Definition Language (PDDL) goals using a Large Language Model (LLM) by leveraging context from the shared 3D scene graph and robot capabilities. We provide an experimental assessment of the performance of our system on real-world tasks in large-scale, outdoor environments.

* 12 pages, 4 figures

Via

Access Paper or Ask Questions

Gaussian Mapping for Evolving Scenes

Jun 07, 2025

Vladimir Yugay, Thies Kersten, Luca Carlone, Theo Gevers, Martin R. Oswald, Lukas Schmid

Abstract:Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, with augmented reality, robotics, and autonomous driving applications. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have started addressing short-term dynamics (motion within the view of the camera), long-term dynamics (the scene evolving through changes out of view) remain less explored. To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We evaluate Gaussian Mapping for Evolving Scenes (GaME) on both synthetic and real-world datasets and find it to be more accurate than the state of the art.

Via

Access Paper or Ask Questions

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

May 18, 2025

Dominic Maggio, Hyungtae Lim, Luca Carlone

Abstract:We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

Via

Access Paper or Ask Questions

An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments

Apr 18, 2025

Ali Agha, Kyohei Otsu, Benjamin Morrell, David D. Fan, Sung-Kyun Kim, Muhammad Fadhil Ginting, Xianmei Lei, Jeffrey Edlund, Seyed Fakoorian, Amanda Bouman(+79 more)

Abstract:This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: (i) large-scale geometric and semantic environment mapping; (ii) an adaptive positioning system; (iii) probabilistic traversability analysis and local planning; (iv) large-scale POMDP-based global motion planning and exploration behavior; (v) large-scale networking and decentralized reasoning; (vi) communication-aware mission planning; and (vii) multi-modal ground-aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge.

* IEEE Transactions on Field Robotics, vol. 1, pp. 476-526, 2024

Via

Access Paper or Ask Questions

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Apr 10, 2025

Yun Chang, Leonor Fermoselle, Duy Ta, Bernadette Bucher, Luca Carlone, Jiuguang Wang

Abstract:While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.

Via

Access Paper or Ask Questions

BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

Mar 11, 2025

Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park

Abstract:Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.

* 20 pages, 14 figures

Via

Access Paper or Ask Questions

Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting

Mar 07, 2025

Dominic Maggio, Luca Carlone

Abstract:Open-set semantic mapping requires (i) determining the correct granularity to represent the scene (e.g., how should objects be defined), and (ii) fusing semantic knowledge across multiple 2D observations into an overall 3D reconstruction -ideally with a high-fidelity yet low-memory footprint. While most related works bypass the first issue by grouping together primitives with similar semantics (according to some manually tuned threshold), we recognize that the object granularity is task-dependent, and develop a task-driven semantic mapping approach. To address the second issue, current practice is to average visual embedding vectors over multiple views. Instead, we show the benefits of using a probabilistic approach based on the properties of the underlying visual-language foundation model, and leveraging Bayesian updating to aggregate multiple observations of the scene. The result is Bayesian Fields, a task-driven and probabilistic approach for open-set semantic mapping. To enable high-fidelity objects and a dense scene representation, Bayesian Fields uses 3D Gaussians which we cluster into task-relevant objects, allowing for both easy 3D object extraction and reduced memory usage. We release Bayesian Fields open-source at https: //github.com/MIT-SPARK/Bayesian-Fields.

Via

Access Paper or Ask Questions

Outlier-Robust Training of Machine Learning Models

Dec 31, 2024

Rajat Talak, Charis Georgiou, Jingnan Shi, Luca Carlone

Abstract:Robust training of machine learning models in the presence of outliers has garnered attention across various domains. The use of robust losses is a popular approach and is known to mitigate the impact of outliers. We bring to light two literatures that have diverged in their ways of designing robust losses: one using M-estimation, which is popular in robotics and computer vision, and another using a risk-minimization framework, which is popular in deep learning. We first show that a simple modification of the Black-Rangarajan duality provides a unifying view. The modified duality brings out a definition of a robust loss kernel $\sigma$ that is satisfied by robust losses in both the literatures. Secondly, using the modified duality, we propose an Adaptive Alternation Algorithm (AAA) for training machine learning models with outliers. The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each iteration. The algorithm is augmented with a novel parameter update rule by interpreting the weights as inlier probabilities, and obviates the need for complex parameter tuning. Thirdly, we investigate convergence of the adaptive alternation algorithm to outlier-free optima. Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels {\sigma} increases the region of convergence. We experimentally show the efficacy of our algorithm on regression, classification, and neural scene reconstruction problems. We release our implementation code: https://github.com/MIT-SPARK/ORT.

Via

Access Paper or Ask Questions

Integrating Vision Systems and STPA for Robust Landing and Take-Off in VTOL Aircraft

Dec 12, 2024

Sandeep Banik, Jinrae Kim, Naira Hovakimyan, Luca Carlone, John P. Thomas, Nancy G. Leveson

Abstract:Vertical take-off and landing (VTOL) unmanned aerial vehicles (UAVs) are versatile platforms widely used in applications such as surveillance, search and rescue, and urban air mobility. Despite their potential, the critical phases of take-off and landing in uncertain and dynamic environments pose significant safety challenges due to environmental uncertainties, sensor noise, and system-level interactions. This paper presents an integrated approach combining vision-based sensor fusion with System-Theoretic Process Analysis (STPA) to enhance the safety and robustness of VTOL UAV operations during take-off and landing. By incorporating fiducial markers, such as AprilTags, into the control architecture, and performing comprehensive hazard analysis, we identify unsafe control actions and propose mitigation strategies. Key contributions include developing the control structure with vision system capable of identifying a fiducial marker, multirotor controller and corresponding unsafe control actions and mitigation strategies. The proposed solution is expected to improve the reliability and safety of VTOL UAV operations, paving the way for resilient autonomous systems.

* 12 pages, 5 figures and 5 tables. Submitted to SciTech 2025

Via

Access Paper or Ask Questions