Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Cremers

PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments

Nov 24, 2024

Haoang Li, Xiangqi Meng, Xingxing Zuo, Zhe Liu, Hesheng Wang, Daniel Cremers

Figure 1 for PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments

Figure 2 for PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments

Figure 3 for PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments

Figure 4 for PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments

Abstract:Simultaneous localization and mapping (SLAM) has achieved impressive performance in static environments. However, SLAM in dynamic environments remains an open question. Many methods directly filter out dynamic objects, resulting in incomplete scene reconstruction and limited accuracy of camera localization. The other works express dynamic objects by point clouds, sparse joints, or coarse meshes, which fails to provide a photo-realistic representation. To overcome the above limitations, we propose a photo-realistic and geometry-aware RGB-D SLAM method by extending Gaussian splatting. Our method is composed of three main modules to 1) map the dynamic foreground including non-rigid humans and rigid items, 2) reconstruct the static background, and 3) localize the camera. To map the foreground, we focus on modeling the deformations and/or motions. We consider the shape priors of humans and exploit geometric and appearance constraints of humans and items. For background mapping, we design an optimization strategy between neighboring local maps by integrating appearance constraint into geometric alignment. As to camera localization, we leverage both static background and dynamic foreground to increase the observations for noise compensation. We explore the geometric and appearance constraints by associating 3D Gaussians with 2D optical flows and pixel patches. Experiments on various real-world datasets demonstrate that our method outperforms state-of-the-art approaches in terms of camera localization and scene representation. Source codes will be publicly available upon paper acceptance.

Via

Access Paper or Ask Questions

ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset

Nov 07, 2024

Olaf Wysocki, Yue Tan, Thomas Froech, Yan Xia, Magdalena Wysocki, Ludwig Hoegner, Daniel Cremers, Christoph Holst

Figure 1 for ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset

Figure 2 for ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset

Figure 3 for ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset

Figure 4 for ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset

Abstract:Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods' comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. Moreover, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

* Accepted to WACV 2025 (IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))

Via

Access Paper or Ask Questions

Variational Low-Rank Adaptation Using IVON

Nov 07, 2024

Bai Cong, Nico Daheim, Yuesong Shen, Daniel Cremers, Rio Yokota, Mohammad Emtiyaz Khan, Thomas Möllenhoff

Figure 1 for Variational Low-Rank Adaptation Using IVON

Figure 2 for Variational Low-Rank Adaptation Using IVON

Figure 3 for Variational Low-Rank Adaptation Using IVON

Figure 4 for Variational Low-Rank Adaptation Using IVON

Abstract:We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models. The code is available at https://github.com/team-approx-bayes/ivon-lora.

* Published at 38th Workshop on Fine-Tuning in Machine Learning (NeurIPS 2024). Code available at https://github.com/team-approx-bayes/ivon-lora

Via

Access Paper or Ask Questions

Interactions Across Blocks in Post-Training Quantization of Large Language Models

Nov 06, 2024

Khasmamad Shabanovi, Lukas Wiest, Vladimir Golkov, Daniel Cremers, Thomas Pfeil

Figure 1 for Interactions Across Blocks in Post-Training Quantization of Large Language Models

Figure 2 for Interactions Across Blocks in Post-Training Quantization of Large Language Models

Figure 3 for Interactions Across Blocks in Post-Training Quantization of Large Language Models

Figure 4 for Interactions Across Blocks in Post-Training Quantization of Large Language Models

Abstract:Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized blocks. The second incorporates knowledge of subsequent blocks by minimizing the error in downstream pre-activations rather than focusing solely on the quantized block. Our findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.

Via

Access Paper or Ask Questions

Beyond Complete Shapes: A quantitative Evaluation of 3D Shape Matching Algorithms

Nov 05, 2024

Viktoria Ehm, Nafie El Amrani, Yizheng Xie, Lennart Bastian, Maolin Gao, Weikang Wang, Lu Sang, Dongliang Cao, Zorah Lähner, Daniel Cremers(+1 more)

Figure 1 for Beyond Complete Shapes: A quantitative Evaluation of 3D Shape Matching Algorithms

Figure 2 for Beyond Complete Shapes: A quantitative Evaluation of 3D Shape Matching Algorithms

Figure 3 for Beyond Complete Shapes: A quantitative Evaluation of 3D Shape Matching Algorithms

Figure 4 for Beyond Complete Shapes: A quantitative Evaluation of 3D Shape Matching Algorithms

Abstract:Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. While approaches based on machine learning dominate modern 3D shape matching, almost all existing (learning-based) methods require that at least one of the involved shapes is complete. In contrast, the most challenging and arguably most practically relevant setting of matching partially observed shapes, is currently underexplored. One important factor is that existing datasets contain only a small number of shapes (typically below 100), which are unable to serve data-hungry machine learning approaches, particularly in the unsupervised regime. In addition, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations and to encourage research on these relevant settings, we provide a generic and flexible framework for the procedural generation of challenging partial shape matching scenarios. Our framework allows for a virtually infinite generation of partial shape matching instances from a finite set of shapes with complete geometry. Further, we manually create cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, leading to a total of 2543 shapes. Based on this, we propose several challenging partial benchmark settings, for which we evaluate respective state-of-the-art methods as baselines.

Via

Access Paper or Ask Questions

MaskBit: Embedding-free Image Generation via Bit Tokens

Sep 24, 2024

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen

Figure 1 for MaskBit: Embedding-free Image Generation via Bit Tokens

Figure 2 for MaskBit: Embedding-free Image Generation via Bit Tokens

Figure 3 for MaskBit: Embedding-free Image Generation via Bit Tokens

Figure 4 for MaskBit: Embedding-free Image Generation via Bit Tokens

Abstract:Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

* Project page: https://weber-mark.github.io/projects/maskbit.html

Via

Access Paper or Ask Questions

Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

Sep 18, 2024

Lei Cheng, Junpeng Hu, Haodong Yan, Mariia Gladkova, Tianyu Huang, Yun-Hui Liu, Daniel Cremers, Haoang Li

Figure 1 for Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

Figure 2 for Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

Figure 3 for Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

Figure 4 for Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

Abstract:Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.

* Accepted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

Via

Access Paper or Ask Questions

MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Aug 29, 2024

Linyan Yang, Lukas Hoyer, Mark Weber, Tobias Fischer, Dengxin Dai, Laura Leal-Taixé, Marc Pollefeys, Daniel Cremers, Luc Van Gool

Figure 1 for MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Figure 2 for MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Figure 3 for MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Figure 4 for MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Abstract:Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.

Via

Access Paper or Ask Questions

Image Segmentation in Foundation Model Era: A Survey

Aug 23, 2024

Tianfei Zhou, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, Daniel Cremers

Figure 1 for Image Segmentation in Foundation Model Era: A Survey

Figure 2 for Image Segmentation in Foundation Model Era: A Survey

Figure 3 for Image Segmentation in Foundation Model Era: A Survey

Abstract:Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research -- generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) -- by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research. We envisage that this fresh, comprehensive, and systematic survey catalyzes the evolution of advanced image segmentation systems.

* A comprehensive survey of image segmentation in foundation model era (work in progress)

Via

Access Paper or Ask Questions

LiFCal: Online Light Field Camera Calibration via Bundle Adjustment

Aug 21, 2024

Aymeric Fleith, Doaa Ahmed, Daniel Cremers, Niclas Zeller

Figure 1 for LiFCal: Online Light Field Camera Calibration via Bundle Adjustment

Figure 2 for LiFCal: Online Light Field Camera Calibration via Bundle Adjustment

Figure 3 for LiFCal: Online Light Field Camera Calibration via Bundle Adjustment

Figure 4 for LiFCal: Online Light Field Camera Calibration via Bundle Adjustment

Abstract:We propose LiFCal, a novel geometric online calibration pipeline for MLA-based light field cameras. LiFCal accurately determines model parameters from a moving camera sequence without precise calibration targets, integrating arbitrary metric scaling constraints. It optimizes intrinsic parameters of the light field camera model, the 3D coordinates of a sparse set of scene points and camera poses in a single bundle adjustment defined directly on micro image points. We show that LiFCal can reliably and repeatably calibrate a focused plenoptic camera using different input sequences, providing intrinsic camera parameters extremely close to state-of-the-art methods, while offering two main advantages: it can be applied in a target-free scene, and it is implemented online in a complete and continuous pipeline. Furthermore, we demonstrate the quality of the obtained camera parameters in downstream tasks like depth estimation and SLAM. Webpage: https://lifcal.github.io/

* Accepted to the German Conference on Pattern Recognition (GCPR) 2024

Via

Access Paper or Ask Questions