Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Cremers

OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion

Apr 30, 2025

Shuhao Kang, Martin Y. Liao, Yan Xia, Olaf Wysocki, Boris Jutzi, Daniel Cremers

Abstract:LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap (OSM) as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components. First, a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning. Second, an adaptive radial fusion module that dynamically consolidates radial features into discriminative global descriptors. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches, along with 12x faster inference speed compared to the state-of-the-art approach. Code and datasets will be publicly available.

* Technical report. 15 pages, 9 figures

Via

Access Paper or Ask Questions

Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Apr 24, 2025

Oussema Dhaouadi, Johannes Meier, Luca Wahl, Jacques Kaiser, Luca Scalerandi, Nick Wandelburg, Zhuolun Zhou, Nijanthan Berinpanathan, Holger Banzhaf, Daniel Cremers

Figure 1 for Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Figure 2 for Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Figure 3 for Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Figure 4 for Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

Abstract:Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at app.deepscenario.com, facilitating research in motion prediction, behavior modeling, and safety validation.

Via

Access Paper or Ask Questions

PRaDA: Projective Radial Distortion Averaging

Apr 23, 2025

Daniil Sinitsyn, Linus Härenstam-Nielsen, Daniel Cremers

Figure 1 for PRaDA: Projective Radial Distortion Averaging

Figure 2 for PRaDA: Projective Radial Distortion Averaging

Figure 3 for PRaDA: Projective Radial Distortion Averaging

Figure 4 for PRaDA: Projective Radial Distortion Averaging

Abstract:We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions. Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate. In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion. Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.

* Accepted at CVPR 2025. 8 pages + references

Via

Access Paper or Ask Questions

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Apr 20, 2025

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, Daniel Cremers

Abstract:Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

* Project page: https://wrchen530.github.io/projects/batrack/

Via

Access Paper or Ask Questions

TwoSquared: 4D Generation from 2D Image Pairs

Apr 17, 2025

Lu Sang, Zehranaz Canfes, Dongliang Cao, Riccardo Marin, Florian Bernard, Daniel Cremers

Figure 1 for TwoSquared: 4D Generation from 2D Image Pairs

Figure 2 for TwoSquared: 4D Generation from 2D Image Pairs

Figure 3 for TwoSquared: 4D Generation from 2D Image Pairs

Figure 4 for TwoSquared: 4D Generation from 2D Image Pairs

Abstract:Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

Via

Access Paper or Ask Questions

RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning

Apr 16, 2025

Yuan Luo, Rudolf Hoffmann, Yan Xia, Olaf Wysocki, Benedikt Schwab, Thomas H. Kolbe, Daniel Cremers

Abstract:Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gpp-communication.github.io/RADLER .

* The paper accepted for CVPRW '25 (PBVS 2025 - the Perception Beyond the Visible Spectrum)

Via

Access Paper or Ask Questions

Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

Apr 15, 2025

Oussema Dhaouadi, Johannes Meier, Jacques Kaiser, Daniel Cremers

Figure 1 for Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

Figure 2 for Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

Figure 3 for Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

Figure 4 for Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

Abstract:Road surface reconstruction from aerial images is fundamental for autonomous driving, urban planning, and virtual simulation, where smoothness, compactness, and accuracy are critical quality factors. Existing reconstruction methods often produce artifacts and inconsistencies that limit usability, while downstream tasks have a tendency to represent roads as planes for simplicity but at the cost of accuracy. We introduce FlexRoad, the first framework to directly address road surface smoothing by fitting Non-Uniform Rational B-Splines (NURBS) surfaces to 3D road points obtained from photogrammetric reconstructions or geodata providers. Our method at its core utilizes the Elevation-Constrained Spatial Road Clustering (ECSRC) algorithm for robust anomaly correction, significantly reducing surface roughness and fitting errors. To facilitate quantitative comparison between road surface reconstruction methods, we present GeoRoad Dataset (GeRoD), a diverse collection of road surface and terrain profiles derived from openly accessible geodata. Experiments on GeRoD and the photogrammetry-based DeepScenario Open 3D Dataset (DSC3D) demonstrate that FlexRoad considerably surpasses commonly used road surface representations across various metrics while being insensitive to various input sources, terrains, and noise types. By performing ablation studies, we identify the key role of each component towards high-quality reconstruction performance, making FlexRoad a generic method for realistic road surface modeling.

Via

Access Paper or Ask Questions

PRISM: Probabilistic Representation for Integrated Shape Modeling and Generation

Apr 06, 2025

Lei Cheng, Mahdi Saleh, Qing Cheng, Lu Sang, Hongli Xu, Daniel Cremers, Federico Tombari

Abstract:Despite the advancements in 3D full-shape generation, accurately modeling complex geometries and semantics of shape parts remains a significant challenge, particularly for shapes with varying numbers of parts. Current methods struggle to effectively integrate the contextual and structural information of 3D shapes into their generative processes. We address these limitations with PRISM, a novel compositional approach for 3D shape generation that integrates categorical diffusion models with Statistical Shape Models (SSM) and Gaussian Mixture Models (GMM). Our method employs compositional SSMs to capture part-level geometric variations and uses GMM to represent part semantics in a continuous space. This integration enables both high fidelity and diversity in generated shapes while preserving structural coherence. Through extensive experiments on shape generation and manipulation tasks, we demonstrate that our approach significantly outperforms previous methods in both quality and controllability of part-level operations. Our code will be made publicly available.

* Project page: https://starry-lei.github.io/prism_3d_shape

Via

Access Paper or Ask Questions

Scene-Centric Unsupervised Panoptic Segmentation

Apr 02, 2025

Oliver Hahn, Christoph Reich, Nikita Araslanov, Daniel Cremers, Christian Rupprecht, Stefan Roth

Abstract:Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

* To appear at CVPR 2025. Christoph Reich and Oliver Hahn - both authors contributed equally. Code: https://github.com/visinf/cups Project page: https://visinf.github.io/cups/

Via

Access Paper or Ask Questions

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Mar 31, 2025

Dominik Schnaus, Nikita Araslanov, Daniel Cremers

Abstract:The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

* Accepted to CVPR 2025, Project page: https://dominik-schnaus.github.io/itsamatch/

Via

Access Paper or Ask Questions