Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming-Hsuan Yang

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Oct 31, 2024

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, Songyou Peng

Figure 1 for No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Figure 2 for No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Figure 3 for No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Figure 4 for No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Abstract:We introduce NoPoSplat, a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from \textit{unposed} sparse multi-view images. Our model, trained exclusively with photometric loss, achieves real-time 3D Gaussian reconstruction during inference. To eliminate the need for accurate pose input during reconstruction, we anchor one input view's local camera coordinates as the canonical space and train the network to predict Gaussian primitives for all views within this space. This approach obviates the need to transform Gaussian primitives from local coordinates into a global coordinate system, thus avoiding errors associated with per-frame Gaussians and pose estimation. To resolve scale ambiguity, we design and compare various intrinsic embedding methods, ultimately opting to convert camera intrinsics into a token embedding and concatenate it with image tokens as input to the model, enabling accurate scene scale prediction. We utilize the reconstructed 3D Gaussians for novel view synthesis and pose estimation tasks and propose a two-stage coarse-to-fine pipeline for accurate pose estimation. Experimental results demonstrate that our pose-free approach can achieve superior novel view synthesis quality compared to pose-required methods, particularly in scenarios with limited input image overlap. For pose estimation, our method, trained without ground truth depth or explicit matching loss, significantly outperforms the state-of-the-art methods with substantial improvements. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios. Code and trained models are available at https://noposplat.github.io/.

* Project page: https://noposplat.github.io/

Via

Access Paper or Ask Questions

Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

Oct 20, 2024

Junwei Zhou, Xueting Li, Lu Qi, Ming-Hsuan Yang

Figure 1 for Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

Figure 2 for Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

Figure 3 for Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

Figure 4 for Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

Abstract:We present Layout-Your-3D, a framework that allows controllable and compositional 3D generation from text prompts. Existing text-to-3D methods often struggle to generate assets with plausible object interactions or require tedious optimization processes. To address these challenges, our approach leverages 2D layouts as a blueprint to facilitate precise and plausible control over 3D generation. Starting with a 2D layout provided by a user or generated from a text description, we first create a coarse 3D scene using a carefully designed initialization process based on efficient reconstruction models. To enforce coherent global 3D layouts and enhance the quality of instance appearances, we propose a collision-aware layout optimization process followed by instance-wise refinement. Experimental results demonstrate that Layout-Your-3D yields more reasonable and visually appealing compositional 3D assets while significantly reducing the time required for each prompt. Additionally, Layout-Your-3D can be easily applicable to downstream tasks, such as 3D editing and object insertion. Our project page is available at:https://colezwhy.github.io/layoutyour3d/

* 21 pages,17 figures

Via

Access Paper or Ask Questions

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

Oct 16, 2024

Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang(+1 more)

Abstract:We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modalities to accomplish the task. Existing benchmarks are limited to single modality or dual-modality tasks, overlooking comprehensive multi-modal assessments of model reasoning. To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities--audio, images, video, and hybrids (Omnify). (2)realistic subset: a real-world dataset, manually curated and annotated by experts, for evaluating cross-modal reasoning in natural settings. OmnixR presents a unique evaluation towards assessing OLMs over a diverse mix of modalities, such as a question that involves video, audio, and text, providing a rigorous cross-modal reasoning testbed unlike any existing benchmarks. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer. Further analysis highlights differences in reasoning behavior, underscoring the challenges of omni-modal AI alignment.

* 19 pages, 6 figures, 12 tables

Via

Access Paper or Ask Questions

KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Oct 15, 2024

Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C. K. Chan, Hexiang Hu, Yu-Chuan Su(+1 more)

Figure 1 for KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Figure 2 for KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Figure 3 for KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Figure 4 for KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

Abstract:Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities - a task requiring real-world knowledge. To address this gap, we propose a benchmark focused on evaluating Knowledge-InTensive image generaTion on real-world ENtities (i.e., KITTEN). Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. We evaluate the latest text-to-image models and retrieval-augmented customization models using both automatic metrics and carefully-designed human evaluations, with an emphasis on the fidelity of entities in the generated images. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details. Although retrieval-augmented models can enhance the fidelity of entity by incorporating reference images during testing, they often over-rely on these references and struggle to produce novel configurations of the entity as requested in creative text prompts.

* Project page: https://kitten-project.github.io/

Via

Access Paper or Ask Questions

A Simple Approach to Unifying Diffusion-based Conditional Generation

Oct 15, 2024

Xirui Li, Charles Herrmann, Kelvin C. K. Chan, Yinxiao Li, Deqing Sun, Chao Ma, Ming-Hsuan Yang

Figure 1 for A Simple Approach to Unifying Diffusion-based Conditional Generation

Figure 2 for A Simple Approach to Unifying Diffusion-based Conditional Generation

Figure 3 for A Simple Approach to Unifying Diffusion-based Conditional Generation

Figure 4 for A Simple Approach to Unifying Diffusion-based Conditional Generation

Abstract:Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce significant complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simple formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.

* Project page: https://lixirui142.github.io/unicon-diffusion/

Via

Access Paper or Ask Questions

Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Oct 14, 2024

Jingzhi Bao, Xueting Li, Ming-Hsuan Yang

Figure 1 for Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Figure 2 for Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Figure 3 for Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Figure 4 for Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Abstract:3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use, playing a crucial role in movies, games, AR, and VR. However, creating temporally consistent and realistic textures for mesh sequences remains labor-intensive for professional artists. On the other hand, while video diffusion models excel at text-driven video generation, they often lack 3D geometry awareness and struggle with achieving multi-view consistent texturing for 3D meshes. In this work, we present Tex4D, a zero-shot approach that integrates inherent 3D geometry knowledge from mesh sequences with the expressiveness of video diffusion models to produce multi-view and temporally consistent 4D textures. Given an untextured mesh sequence and a text prompt as inputs, our method enhances multi-view consistency by synchronizing the diffusion process across different views through latent aggregation in the UV space. To ensure temporal consistency, we leverage prior knowledge from a conditional video generation model for texture synthesis. However, straightforwardly combining the video diffusion model and the UV texture aggregation leads to blurry results. We analyze the underlying causes and propose a simple yet effective modification to the DDIM sampling process to address this issue. Additionally, we introduce a reference latent texture to strengthen the correlation between frames during the denoising process. To the best of our knowledge, Tex4D is the first method specifically designed for 4D scene texturing. Extensive experiments demonstrate its superiority in producing multi-view and multi-frame consistent videos based on untextured mesh sequences.

* Project page: https://tex4d.github.io/

Via

Access Paper or Ask Questions

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Oct 07, 2024

Yujin Tang, Lu Qi, Fei Xie, Xiangtai Li, Chao Ma, Ming-Hsuan Yang

Figure 1 for PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Figure 2 for PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Figure 3 for PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Figure 4 for PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Abstract:Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer.

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Oct 04, 2024

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, Ming-Hsuan Yang

Figure 1 for MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Figure 2 for MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Figure 3 for MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Figure 4 for MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Abstract:Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction.

* Project page: https://monst3r-project.github.io/

Via

Access Paper or Ask Questions

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Sep 09, 2024

Henghui Ding, Lingyi Hong, Chang Liu, Ning Xu, Linjie Yang, Yuchen Fan, Deshui Miao, Yameng Gu, Xin Li, Zhenyu He(+23 more)

Figure 1 for LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Figure 2 for LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Figure 3 for LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Figure 4 for LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Abstract:Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/.

* ECCV 2024 LSVOS Challenge Report: https://lsvos.github.io/

Via

Access Paper or Ask Questions

Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Aug 29, 2024

Deshui Miao, Yameng Gu, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang

Figure 1 for Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Figure 2 for Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Figure 3 for Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

Abstract:Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions. To address these challenges, the MOSE dataset aims to enhance object recognition and differentiation in complex environments, while the LVOS dataset focuses on segmenting objects exhibiting long-term, intricate movements. This report introduces a discriminative spatial-temporal VOS model that utilizes discriminative object features as query representations. The semantic understanding of spatial-semantic modules enables it to recognize object parts, while salient features highlight more distinctive object characteristics. Our model, trained on extensive VOS datasets, achieved first place (\textbf{80.90\%} $\mathcal{J \& F}$) on the test set of the 6th LSVOS challenge in the VOS Track, demonstrating its effectiveness in tackling the aforementioned challenges. The code will be available at \href{https://github.com/yahooo-m/VOS-Solution}{code}.

* 1st Place Solution for 6th LSVOS VOS Track. arXiv admin note: substantial text overlap with arXiv:2406.04600

Via

Access Paper or Ask Questions