Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weining Ren

Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

May 22, 2025

Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

Abstract:This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation.Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

* Project page: https://visual-ai.github.io/pancap/

Via

Access Paper or Ask Questions

NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

May 29, 2024

Weining Ren, Zihan Zhu, Boyang Sun, Jiaqi Chen, Marc Pollefeys, Songyou Peng

Figure 1 for NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Figure 2 for NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Figure 3 for NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Figure 4 for NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Abstract:Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes, but face challenges in dynamic, real-world environments with distractors like moving objects, shadows, and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality, especially under high occlusion scenarios. In this paper, we introduce NeRF On-the-go, a simple yet effective approach that enables the robust synthesis of novel views in complex, in-the-wild scenes from only casually captured image sequences. Delving into uncertainty, our method not only efficiently eliminates distractors, even when they are predominant in captures, but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes, our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications.

* CVPR 2024, first two authors contributed equally. Project Page: https://nerf-on-the-go.github.io

Via

Access Paper or Ask Questions

Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Mar 07, 2024

Stamatios Georgoulis, Weining Ren, Alfredo Bochicchio, Daniel Eckert, Yuanyou Li, Abel Gawel

Figure 1 for Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Figure 2 for Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Figure 3 for Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Figure 4 for Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes

Abstract:Rapid and reliable identification of dynamic scene parts, also known as motion segmentation, is a key challenge for mobile sensors. Contemporary RGB camera-based methods rely on modeling camera and scene properties however, are often under-constrained and fall short in unknown categories. Event cameras have the potential to overcome these limitations, but corresponding methods have only been demonstrated in smaller-scale indoor environments with simplified dynamic objects. This work presents an event-based method for class-agnostic motion segmentation that can successfully be deployed across complex large-scale outdoor environments too. To this end, we introduce a novel divide-and-conquer pipeline that combines: (a) ego-motion compensated events, computed via a scene understanding module that predicts monocular depth and camera pose as auxiliary tasks, and (b) optical flow from a dedicated optical flow module. These intermediate representations are then fed into a segmentation module that predicts motion segmentation masks. A novel transformer-based temporal attention module in the segmentation module builds correlations across adjacent 'frames' to get temporally consistent segmentation masks. Our method sets the new state-of-the-art on the classic EV-IMO benchmark (indoors), where we achieve improvements of 2.19 moving object IoU (2.22 mIoU) and 4.52 point IoU respectively, as well as on a newly-generated motion segmentation and tracking benchmark (outdoors) based on the DSEC event dataset, termed DSEC-MOTS, where we show improvement of 12.91 moving object IoU.

* 3DV 2024, the first two authors contributed equally

Via

Access Paper or Ask Questions

3D Textured Shape Recovery with Learned Geometric Priors

Sep 07, 2022

Lei Li, Zhizheng Liu, Weining Ren, Liudi Yang, Fangjinhua Wang, Marc Pollefeys, Songyou Peng

Figure 1 for 3D Textured Shape Recovery with Learned Geometric Priors

Figure 2 for 3D Textured Shape Recovery with Learned Geometric Priors

Figure 3 for 3D Textured Shape Recovery with Learned Geometric Priors

Figure 4 for 3D Textured Shape Recovery with Learned Geometric Priors

Abstract:3D textured shape recovery from partial scans is crucial for many real-world applications. Existing approaches have demonstrated the efficacy of implicit function representation, but they suffer from partial inputs with severe occlusions and varying object types, which greatly hinders their application value in the real world. This technical report presents our approach to address these limitations by incorporating learned geometric priors. To this end, we generate a SMPL model from learned pose prediction and fuse it into the partial input to add prior knowledge of human bodies. We also propose a novel completeness-aware bounding box adaptation for handling different levels of scales and partialness of partial scans.

* 5 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions