Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Federico Tombari

P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

Aug 29, 2024

Mathias Vogel, Keisuke Tateno, Marc Pollefeys, Federico Tombari, Marie-Julie Rakotosaona, Francis Engelmann

Figure 1 for P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

Figure 2 for P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

Figure 3 for P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

Figure 4 for P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

Abstract:In this work, we tackle the task of point cloud denoising through a novel framework that adapts Diffusion Schr\"odinger bridges to points clouds. Unlike previous approaches that predict point-wise displacements from point features or learned noise distributions, our method learns an optimal transport plan between paired point clouds. Experiments on object datasets like PU-Net and real-world datasets such as ScanNet++ and ARKitScenes show that P2P-Bridge achieves significant improvements over existing methods. While our approach demonstrates strong results using only point coordinates, we also show that incorporating additional features, such as color information or point-wise DINOv2 features, further enhances the performance. Code and pretrained models are available at https://p2p-bridge.github.io.

* ECCV 2024 Project page: https://p2p-bridge.github.io

Via

Access Paper or Ask Questions

Extracting Training Data from Document-Based VQA Models

Jul 11, 2024

Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari

Figure 1 for Extracting Training Data from Document-Based VQA Models

Figure 2 for Extracting Training Data from Document-Based VQA Models

Figure 3 for Extracting Training Data from Document-Based VQA Models

Figure 4 for Extracting Training Data from Document-Based VQA Models

Abstract:Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Information (PII) repeated once in the training set, indicating these models could divulge memorised sensitive information and therefore pose a privacy risk. We quantitatively measure the extractability of information in controlled experiments and differentiate between cases where it arises from generalization capabilities or from memorization. We further investigate the factors that influence memorization across multiple state-of-the-art models and propose an effective heuristic countermeasure that empirically prevents the extractability of PII.

* ICML 2024

Via

Access Paper or Ask Questions

Toward a Diffusion-Based Generalist for Dense Vision Tasks

Jun 29, 2024

Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari

Figure 1 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Figure 2 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Figure 3 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Figure 4 for Toward a Diffusion-Based Generalist for Dense Vision Tasks

Abstract:Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

* Published at CVPR 2024 as a workshop paper

Via

Access Paper or Ask Questions

Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Jun 26, 2024

Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, Leonidas Guibas

Figure 1 for Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Figure 2 for Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Figure 3 for Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Figure 4 for Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Abstract:Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian "marbles", reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

Via

Access Paper or Ask Questions

Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Jun 20, 2024

Matthew Zheng, Enis Simsar, Hidir Yesiltepe, Federico Tombari, Joel Simon, Pinar Yanardag

Figure 1 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Figure 2 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Figure 3 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Figure 4 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Abstract:Text-to-image models are becoming increasingly popular, revolutionizing the landscape of digital art creation by enabling highly detailed and creative visual content generation. These models have been widely employed across various domains, particularly in art generation, where they facilitate a broad spectrum of creative expression and democratize access to artistic creation. In this paper, we introduce \texttt{STYLEBREEDER}, a comprehensive dataset of 6.8M images and 1.8M prompts generated by 95K users on Artbreeder, a platform that has emerged as a significant hub for creative exploration with over 13M users. We introduce a series of tasks with this dataset aimed at identifying diverse artistic styles, generating personalized content, and recommending styles based on user interests. By documenting unique, user-generated styles that transcend conventional categories like 'cyberpunk' or 'Picasso,' we explore the potential for unique, crowd-sourced styles that could provide deep insights into the collective creative psyche of users worldwide. We also evaluate different personalization methods to enhance artistic expression and introduce a style atlas, making these models available in LoRA format for public use. Our research demonstrates the potential of text-to-image diffusion models to uncover and promote unique artistic expressions, further democratizing AI in art and fostering a more diverse and inclusive artistic community. The dataset, code and models are available at https://stylebreeder.github.io under a Public Domain (CC0) license.

Via

Access Paper or Ask Questions

RaNeuS: Ray-adaptive Neural Surface Reconstruction

Jun 14, 2024

Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract:Our objective is to leverage a differentiable radiance field \eg NeRF to reconstruct detailed 3D surfaces in addition to producing the standard novel view renderings. There have been related methods that perform such tasks, usually by utilizing a signed distance field (SDF). However, the state-of-the-art approaches still fail to correctly reconstruct the small-scale details, such as the leaves, ropes, and textile surfaces. Considering that different methods formulate and optimize the projection from SDF to radiance field with a globally constant Eikonal regularization, we improve with a ray-wise weighting factor to prioritize the rendering and zero-crossing surface fitting on top of establishing a perfect SDF. We propose to adaptively adjust the regularization on the signed distance field so that unsatisfying rendering rays won't enforce strong Eikonal regularization which is ineffective, and allow the gradients from regions with well-learned radiance to effectively back-propagated to the SDF. Consequently, balancing the two objectives in order to generate accurate and detailed surfaces. Additionally, concerning whether there is a geometric bias between the zero-crossing surface in SDF and rendering points in the radiance field, the projection becomes adjustable as well depending on different 3D locations during optimization. Our proposed \textit{RaNeuS} are extensively evaluated on both synthetic and real datasets, achieving state-of-the-art results on both novel view synthesis and geometric reconstruction.

* 3DV 2024, oral. In: Proceedings of the IEEE/CVF International Conference on 3D Vision (2023)

Via

Access Paper or Ask Questions

Mixed Diffusion for 3D Indoor Scene Synthesis

May 31, 2024

Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, Federico Tombari

Figure 1 for Mixed Diffusion for 3D Indoor Scene Synthesis

Figure 2 for Mixed Diffusion for 3D Indoor Scene Synthesis

Figure 3 for Mixed Diffusion for 3D Indoor Scene Synthesis

Figure 4 for Mixed Diffusion for 3D Indoor Scene Synthesis

Abstract:Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully explored in floor-conditioned scene synthesis problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture, designed to synthesize plausible 3D indoor scenes from given room types, floor plans, and potentially pre-existing objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our approach uniquely implements structured corruption across the mixed discrete semantic and continuous geometric domains, resulting in a better conditioned problem for the reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. In addition, our models can handle partial object constraints via a corruption-and-masking strategy without task specific training. We show MiDiffusion maintains clear advantages over existing approaches in scene completion and furniture arrangement experiments.

* 19 pages, 14 figures. Under review. Code to be released at: https://github.com/MIT-SPARK/MiDiffusion

Via

Access Paper or Ask Questions

Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

May 26, 2024

Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Luc Van Gool, Martin R. Oswald, Federico Tombari

Figure 1 for Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

Figure 2 for Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

Figure 3 for Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

Figure 4 for Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

Abstract:3D Gaussian Splatting has emerged as a powerful representation of geometry and appearance for RGB-only dense Simultaneous Localization and Mapping (SLAM), as it provides a compact dense map representation while enabling efficient and high-quality map rendering. However, existing methods show significantly worse reconstruction quality than competing methods using other 3D representations, e.g. neural points clouds, since they either do not employ global map and pose optimization or make use of monocular depth. In response, we propose the first RGB-only SLAM system with a dense 3D Gaussian map representation that utilizes all benefits of globally optimized tracking by adapting dynamically to keyframe pose and depth updates by actively deforming the 3D Gaussian map. Moreover, we find that refining the depth updates in inaccurate areas with a monocular depth estimator further improves the accuracy of the 3D reconstruction. Our experiments on the Replica, TUM-RGBD, and ScanNet datasets indicate the effectiveness of globally optimized 3D Gaussians, as the approach achieves superior or on par performance with existing RGB-only SLAM methods methods in tracking, mapping and rendering accuracy while yielding small map sizes and fast runtimes. The source code is available at https://github.com/eriksandstroem/Splat-SLAM.

* 21 pages

Via

Access Paper or Ask Questions

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

May 08, 2024

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

Figure 1 for How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Figure 2 for How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Figure 3 for How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Figure 4 for How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Abstract:Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/.

* Technical report

Via

Access Paper or Ask Questions

Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

May 06, 2024

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

Figure 1 for Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Figure 2 for Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Figure 3 for Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Figure 4 for Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Abstract:Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/.

* Technical report

Via

Access Paper or Ask Questions