Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shengfeng He

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Nov 15, 2024

Jingru Yang, Huan Yu, Yang Jingxin, Chentianye Xu, Yin Biao, Yu Sun, Shengfeng He

Figure 1 for Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Figure 2 for Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Figure 3 for Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Figure 4 for Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Abstract:Multimodal Large Language Models (MLLMs) excel at descriptive tasks within images but often struggle with precise object localization, a critical element for reliable visual interpretation. In contrast, traditional object detection models provide high localization accuracy but frequently generate detections lacking contextual coherence due to limited modeling of inter-object relationships. To address this fundamental limitation, we introduce the \textbf{Visual-Linguistic Agent (VLA), a collaborative framework that combines the relational reasoning strengths of MLLMs with the precise localization capabilities of traditional object detectors. In the VLA paradigm, the MLLM serves as a central Linguistic Agent, working collaboratively with specialized Vision Agents for object detection and classification. The Linguistic Agent evaluates and refines detections by reasoning over spatial and contextual relationships among objects, while the classification Vision Agent offers corrective feedback to improve classification accuracy. This collaborative approach enables VLA to significantly enhance both spatial reasoning and object localization, addressing key challenges in multimodal understanding. Extensive evaluations on the COCO dataset demonstrate substantial performance improvements across multiple detection models, highlighting VLA's potential to set a new benchmark in accurate and contextually coherent object detection.

Via

Access Paper or Ask Questions

Lagrangian Motion Fields for Long-term Motion Generation

Sep 03, 2024

Yifei Yang, Zikai Huang, Chenshu Xu, Shengfeng He

Figure 1 for Lagrangian Motion Fields for Long-term Motion Generation

Figure 2 for Lagrangian Motion Fields for Long-term Motion Generation

Figure 3 for Lagrangian Motion Fields for Long-term Motion Generation

Figure 4 for Lagrangian Motion Fields for Long-term Motion Generation

Abstract:Long-term motion generation is a challenging task that requires producing coherent and realistic sequences over extended durations. Current methods primarily rely on framewise motion representations, which capture only static spatial details and overlook temporal dynamics. This approach leads to significant redundancy across the temporal dimension, complicating the generation of effective long-term motion. To overcome these limitations, we introduce the novel concept of Lagrangian Motion Fields, specifically designed for long-term motion generation. By treating each joint as a Lagrangian particle with uniform velocity over short intervals, our approach condenses motion representations into a series of "supermotions" (analogous to superpixels). This method seamlessly integrates static spatial information with interpretable temporal dynamics, transcending the limitations of existing network architectures and motion sequence content types. Our solution is versatile and lightweight, eliminating the need for neural network preprocessing. Our approach excels in tasks such as long-term music-to-dance generation and text-to-motion generation, offering enhanced efficiency, superior generation quality, and greater diversity compared to existing methods. Additionally, the adaptability of Lagrangian Motion Fields extends to applications like infinite motion looping and fine-grained controlled motion generation, highlighting its broad utility. Video demonstrations are available at \url{https://plyfager.github.io/LaMoG}.

* 13 pages, 9 figures

Via

Access Paper or Ask Questions

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

Aug 23, 2024

Yangyang Xu, Wenqi Shao, Yong Du, Haiming Zhu, Yang Zhou, Ping Luo, Shengfeng He

Abstract:Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbf{T}ask-\textbf{O}riented \textbf{D}iffusion \textbf{I}nversion (\textbf{TODInv}), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended \(\mathcal{P}^*\) space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv's superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.

Via

Access Paper or Ask Questions

G2Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors

Aug 18, 2024

Haoxin Yang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Jing Qin, Yi Wang, Pheng-Ann Heng, Shengfeng He

Figure 1 for G2Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors

Figure 2 for G2Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors

Figure 3 for G2Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors

Figure 4 for G2Face: High-Fidelity Reversible Face Anonymization via Generative and Geometric Priors

Abstract:Reversible face anonymization, unlike traditional face pixelization, seeks to replace sensitive identity information in facial images with synthesized alternatives, preserving privacy without sacrificing image clarity. Traditional methods, such as encoder-decoder networks, often result in significant loss of facial details due to their limited learning capacity. Additionally, relying on latent manipulation in pre-trained GANs can lead to changes in ID-irrelevant attributes, adversely affecting data utility due to GAN inversion inaccuracies. This paper introduces G\textsuperscript{2}Face, which leverages both generative and geometric priors to enhance identity manipulation, achieving high-quality reversible face anonymization without compromising data utility. We utilize a 3D face model to extract geometric information from the input face, integrating it with a pre-trained GAN-based decoder. This synergy of generative and geometric priors allows the decoder to produce realistic anonymized faces with consistent geometry. Moreover, multi-scale facial features are extracted from the original face and combined with the decoder using our novel identity-aware feature fusion blocks (IFF). This integration enables precise blending of the generated facial patterns with the original ID-irrelevant features, resulting in accurate identity manipulation. Extensive experiments demonstrate that our method outperforms existing state-of-the-art techniques in face anonymization and recovery, while preserving high data utility. Code is available at https://github.com/Harxis/G2Face.

Via

Access Paper or Ask Questions

VrdONE: One-stage Video Visual Relation Detection

Aug 18, 2024

Xinjie Jiang, Chenxi Zheng, Xuemiao Xu, Bangzhen Liu, Weiying Zheng, Huaidong Zhang, Shengfeng He

Figure 1 for VrdONE: One-stage Video Visual Relation Detection

Figure 2 for VrdONE: One-stage Video Visual Relation Detection

Figure 3 for VrdONE: One-stage Video Visual Relation Detection

Figure 4 for VrdONE: One-stage Video Visual Relation Detection

Abstract:Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at \textcolor[RGB]{228,58,136}{\href{https://github.com/lucaspk512/vrdone}{https://github.com/lucaspk512/vrdone}}.

* 12 pages, 8 figures, accepted by ACM Multimedia 2024

Via

Access Paper or Ask Questions

DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Jul 26, 2024

Yi Lei, Huilin Zhu, Jingling Yuan, Guangli Xiang, Xian Zhong, Shengfeng He

Figure 1 for DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Figure 2 for DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Figure 3 for DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Figure 4 for DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Abstract:Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones.

Via

Access Paper or Ask Questions

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

Jul 10, 2024

Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, Shengfeng He

Abstract:Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike prior approaches, Beat-It uniquely integrates explicit beat awareness and key pose guidance, effectively resolving two main issues: the misalignment of generated dance motions with musical beats, and the inability to map key poses to specific beats, critical for practical choreography. Our approach disentangles beat conditions from music using a nearest beat distance representation and employs a hierarchical multi-condition fusion mechanism. This mechanism seamlessly integrates key poses, beats, and music features, mitigating condition conflicts and offering rich, multi-conditioned guidance for dance generation. Additionally, a specially designed beat alignment loss ensures the generated dance movements remain in sync with the designated beats. Extensive experiments confirm Beat-It's superiority over existing state-of-the-art methods in terms of beat alignment and motion controllability.

* ECCV 2024

Via

Access Paper or Ask Questions

Zero-shot Object Counting with Good Exemplars

Jul 09, 2024

Huilin Zhu, Jingling Yuan, Zhengwei Yang, Yu Guo, Zheng Wang, Xian Zhong, Shengfeng He

Figure 1 for Zero-shot Object Counting with Good Exemplars

Figure 2 for Zero-shot Object Counting with Good Exemplars

Figure 3 for Zero-shot Object Counting with Good Exemplars

Figure 4 for Zero-shot Object Counting with Good Exemplars

Abstract:Zero-shot object counting (ZOC) aims to enumerate objects in images using only the names of object classes during testing, without the need for manual annotations. However, a critical challenge in current ZOC methods lies in their inability to identify high-quality exemplars effectively. This deficiency hampers scalability across diverse classes and undermines the development of strong visual associations between the identified classes and image content. To this end, we propose the Visual Association-based Zero-shot Object Counting (VA-Count) framework. VA-Count consists of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that synergistically refine the process of class exemplar identification while minimizing the consequences of incorrect object identification. The EEM utilizes advanced vision-language pretaining models to discover potential exemplars, ensuring the framework's adaptability to various classes. Meanwhile, the NSM employs contrastive learning to differentiate between optimal and suboptimal exemplar pairs, reducing the negative effects of erroneous exemplars. VA-Count demonstrates its effectiveness and scalability in zero-shot contexts with superior performance on two object counting datasets.

Via

Access Paper or Ask Questions

OneRestore: A Universal Restoration Framework for Composite Degradation

Jul 05, 2024

Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, Shengfeng He

Figure 1 for OneRestore: A Universal Restoration Framework for Composite Degradation

Figure 2 for OneRestore: A Universal Restoration Framework for Composite Degradation

Figure 3 for OneRestore: A Universal Restoration Framework for Composite Degradation

Figure 4 for OneRestore: A Universal Restoration Framework for Composite Degradation

Abstract:In real-world scenarios, image impairments often manifest as composite degradations, presenting a complex interplay of elements such as low light, haze, rain, and snow. Despite this reality, existing restoration methods typically target isolated degradation types, thereby falling short in environments where multiple degrading factors coexist. To bridge this gap, our study proposes a versatile imaging model that consolidates four physical corruption paradigms to accurately represent complex, composite degradation scenarios. In this context, we propose OneRestore, a novel transformer-based framework designed for adaptive, controllable scene restoration. The proposed framework leverages a unique cross-attention mechanism, merging degraded scene descriptors with image features, allowing for nuanced restoration. Our model allows versatile input scene descriptors, ranging from manual text embeddings to automatic extractions based on visual attributes. Our methodology is further enhanced through a composite degradation restoration loss, using extra degraded images as negative samples to fortify model constraints. Comparative results on synthetic and real-world datasets demonstrate OneRestore as a superior solution, significantly advancing the state-of-the-art in addressing complex, composite degradations.

Via

Access Paper or Ask Questions

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Apr 01, 2024

Haofeng Liu, Chenshu Xu, Yifei Yang, Lihua Zeng, Shengfeng He

Figure 1 for Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Figure 2 for Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Figure 3 for Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Figure 4 for Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Abstract:Point-based interactive editing serves as an essential tool to complement the controllability of existing generative models. A concurrent work, DragDiffusion, updates the diffusion latent map in response to user inputs, causing global latent map alterations. This results in imprecise preservation of the original content and unsuccessful editing due to gradient vanishing. In contrast, we present DragNoise, offering robust and accelerated editing without retracing the latent map. The core rationale of DragNoise lies in utilizing the predicted noise output of each U-Net as a semantic editor. This approach is grounded in two critical observations: firstly, the bottleneck features of U-Net inherently possess semantically rich features ideal for interactive editing; secondly, high-level semantics, established early in the denoising process, show minimal variation in subsequent stages. Leveraging these insights, DragNoise edits diffusion semantics in a single denoising step and efficiently propagates these changes, ensuring stability and efficiency in diffusion editing. Comparative experiments reveal that DragNoise achieves superior control and semantic retention, reducing the optimization time by over 50% compared to DragDiffusion. Our codes are available at https://github.com/haofengl/DragNoise.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions