Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wenguan Wang

A Survey of World Models for Autonomous Driving

Jan 20, 2025

Tuo Feng, Wenguan Wang, Yi Yang

Figure 1 for A Survey of World Models for Autonomous Driving

Figure 2 for A Survey of World Models for Autonomous Driving

Figure 3 for A Survey of World Models for Autonomous Driving

Figure 4 for A Survey of World Models for Autonomous Driving

Abstract:Recent breakthroughs in autonomous driving have revolutionized the way vehicles perceive and interact with their surroundings. In particular, world models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. Such models unify perception, prediction, and planning, thereby enabling autonomous systems to make rapid, informed decisions under complex and often unpredictable conditions. Research trends span diverse areas, including 4D occupancy prediction and generative data synthesis, all of which bolster scene understanding and trajectory forecasting. Notably, recent works exploit large-scale pretraining and advanced self-supervised learning to scale up models' capacity for rare-event simulation and real-time interaction. In addressing key challenges -- ranging from domain adaptation and long-tail anomaly detection to multimodal fusion -- these world models pave the way for more robust, reliable, and adaptable autonomous driving solutions. This survey systematically reviews the state of the art, categorizing techniques by their focus on future prediction, behavior planning, and the interaction between the two. We also identify potential directions for future research, emphasizing holistic integration, improved computational efficiency, and advanced simulation. Our comprehensive analysis underscores the transformative role of world models in driving next-generation autonomous systems toward safer and more equitable mobility.

* Ongoing project

Via

Access Paper or Ask Questions

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Oct 26, 2024

Liulei Li, Wenguan Wang, Yi Yang

Figure 1 for Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Figure 2 for Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Figure 3 for Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Figure 4 for Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Abstract:Prevalent human-object interaction (HOI) detection approaches typically leverage large-scale visual-linguistic models to help recognize events involving humans and objects. Though promising, models trained via contrastive learning on text-image pairs often neglect mid/low-level visual cues and struggle at compositional reasoning. In response, we introduce DIFFUSIONHOI, a new HOI detector shedding light on text-to-image diffusion models. Unlike the aforementioned models, diffusion models excel in discerning mid/low-level visual concepts as generative models, and possess strong compositionality to handle novel concepts expressed in text inputs. Considering diffusion models usually emphasize instance objects, we first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions, and extract HOI-relevant cues from images without heavy fine-tuning. Benefited from above, DIFFUSIONHOI achieves SOTA performance on three datasets under both regular and zero-shot setups.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Scene Graph Generation with Role-Playing Large Language Models

Oct 20, 2024

Guikun Chen, Jin Li, Wenguan Wang

Figure 1 for Scene Graph Generation with Role-Playing Large Language Models

Figure 2 for Scene Graph Generation with Role-Playing Large Language Models

Figure 3 for Scene Graph Generation with Role-Playing Large Language Models

Figure 4 for Scene Graph Generation with Role-Playing Large Language Models

Abstract:Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.

* NeurIPS 2024. Code: https://github.com/guikunchen/SDSGG

Via

Access Paper or Ask Questions

Vision-Language Navigation with Energy-Based Policy

Oct 18, 2024

Rui Liu, Wenguan Wang, Yi Yang

Figure 1 for Vision-Language Navigation with Energy-Based Policy

Figure 2 for Vision-Language Navigation with Energy-Based Policy

Figure 3 for Vision-Language Navigation with Energy-Based Policy

Figure 4 for Vision-Language Navigation with Energy-Based Policy

Abstract:Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.

Via

Access Paper or Ask Questions

Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Sep 16, 2024

Minghan Chen, Guikun Chen, Wenguan Wang, Yi Yang

Figure 1 for Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Figure 2 for Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Figure 3 for Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Figure 4 for Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Abstract:DETR introduces a simplified one-stage framework for scene graph generation (SGG). However, DETR-based SGG models face two challenges: i) Sparse supervision, as each image typically contains fewer than 10 relation annotations, while the models employ over 100 relation queries. This sparsity arises because each ground truth relation is assigned to only one single query during training. ii) False negative samples, since one ground truth relation may have multiple queries with similar matching scores. These suboptimally matched queries are simply treated as negative samples, causing the loss of valuable supervisory signals. As a response, we devise Hydra-SGG, a one-stage SGG method that adopts a new Hybrid Relation Assignment. This assignment combines a One-to-One Relation Assignment with a newly introduced IoU-based One-to-Many Relation Assignment. Specifically, each ground truth is assigned to multiple relation queries with high IoU subject-object boxes. This Hybrid Relation Assignment increases the number of positive training samples, alleviating sparse supervision. Moreover, we, for the first time, empirically show that self-attention over relation queries helps reduce duplicated relation predictions. We, therefore, propose Hydra Branch, a parameter-sharing auxiliary decoder without a self-attention layer. This design promotes One-to-Many Relation Assignment by enabling different queries to predict the same relation. Hydra-SGG achieves state-of-the-art performance with 10.6 mR@20 and 16.0 mR@50 on VG150, while only requiring 12 training epochs. It also sets a new state-of-the-art on Open Images V6 and and GQA.

Via

Access Paper or Ask Questions

Image Segmentation in Foundation Model Era: A Survey

Aug 23, 2024

Tianfei Zhou, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, Daniel Cremers

Figure 1 for Image Segmentation in Foundation Model Era: A Survey

Figure 2 for Image Segmentation in Foundation Model Era: A Survey

Figure 3 for Image Segmentation in Foundation Model Era: A Survey

Abstract:Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research -- generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) -- by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research. We envisage that this fresh, comprehensive, and systematic survey catalyzes the evolution of advanced image segmentation systems.

* A comprehensive survey of image segmentation in foundation model era (work in progress)

Via

Access Paper or Ask Questions

Navigation Instruction Generation with BEV Perception and Large Language Models

Jul 21, 2024

Sheng Fan, Rui Liu, Wenguan Wang, Yi Yang

Figure 1 for Navigation Instruction Generation with BEV Perception and Large Language Models

Figure 2 for Navigation Instruction Generation with BEV Perception and Large Language Models

Figure 3 for Navigation Instruction Generation with BEV Perception and Large Language Models

Figure 4 for Navigation Instruction Generation with BEV Perception and Large Language Models

Abstract:Navigation instruction generation, which requires embodied agents to describe the navigation routes, has been of great interest in robotics and human-computer interaction. Existing studies directly map the sequence of 2D perspective observations to route descriptions. Though straightforward, they overlook the geometric information and object semantics of the 3D environment. To address these challenges, we propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation. Specifically, BEVInstructor constructs a PerspectiveBEVVisual Encoder for the comprehension of 3D environments through fusing BEV and perspective features. To leverage the powerful language capabilities of MLLMs, the fused representations are used as visual prompts for MLLMs, and perspective-BEV prompt tuning is proposed for parameter-efficient updating. Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner. BEVInstructor achieves impressive performance across diverse datasets (i.e., R2R, REVERIE, and UrbanWalk).

* ECCV 2024; Project Page: https://github.com/FanScy/BEVInstructor

Via

Access Paper or Ask Questions

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Jul 15, 2024

Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

Figure 1 for Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Figure 2 for Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Figure 3 for Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Figure 4 for Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Abstract:Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios.

* ECCV 2024; Project page: https://hechang25.github.io/MVSD

Via

Access Paper or Ask Questions

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Jul 14, 2024

Tuo Feng, Wenguan Wang, Ruijie Quan, Yi Yang

Figure 1 for Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Figure 2 for Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Figure 3 for Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Figure 4 for Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Abstract:Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multiscale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to highresolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.

* ECCV 2024; Project page: https://github.com/FengZicai/S2S

Via

Access Paper or Ask Questions

Nonverbal Interaction Detection

Jul 11, 2024

Jianan Wei, Tianfei Zhou, Yi Yang, Wenguan Wang

Figure 1 for Nonverbal Interaction Detection

Figure 2 for Nonverbal Interaction Detection

Figure 3 for Nonverbal Interaction Detection

Figure 4 for Nonverbal Interaction Detection

Abstract:This work addresses a new challenge of understanding human nonverbal interaction in social contexts. Nonverbal signals pervade virtually every communicative act. Our gestures, facial expressions, postures, gaze, even physical appearance all convey messages, without anything being said. Despite their critical role in social life, nonverbal signals receive very limited attention as compared to the linguistic counterparts, and existing solutions typically examine nonverbal cues in isolation. Our study marks the first systematic effort to enhance the interpretation of multifaceted nonverbal signals. First, we contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups, along with 22 atomic-level nonverbal behaviors under five broad interaction types. Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form <individual, group, interaction> from images. Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs. Central to the model is a dual multi-scale hypergraph that adeptly addresses individual-to-individual and group-to-group correlations across varying scales, facilitating interactional feature learning and eventually improving interaction prediction. Extensive experiments on NVI show that NVI-DEHR improves various baselines significantly in NVI-DET. It also exhibits leading performance on HOI-DET, confirming its versatility in supporting related tasks and strong generalization ability. We hope that our study will offer the community new avenues to explore nonverbal signals in more depth.

* ECCV 2024; Project page: https://github.com/weijianan1/NVI

Via

Access Paper or Ask Questions