Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wengang Zhou

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Sep 17, 2024

Weiye Xu, Min Wang, Wengang Zhou, Houqiang Li

Figure 1 for P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Figure 2 for P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Figure 3 for P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Figure 4 for P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Abstract:Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.

Via

Access Paper or Ask Questions

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Aug 30, 2024

Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

Figure 1 for AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Figure 2 for AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Figure 3 for AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Figure 4 for AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Abstract:Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to $1008\times 1008$. Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \url{https://github.com/harrytea/AdaptVision}.

Via

Access Paper or Ask Questions

LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

Aug 25, 2024

Keyi Zhou, Li Li, Wengang Zhou, Yonghui Wang, Hao Feng, Houqiang Li

Abstract:In video lane detection, there are rich temporal contexts among successive frames, which is under-explored in existing lane detectors. In this work, we propose LaneTCA to bridge the individual video frames and explore how to effectively aggregate the temporal context. Technically, we develop an accumulative attention module and an adjacent attention module to abstract the long-term and short-term temporal context, respectively. The accumulative attention module continuously accumulates visual information during the journey of a vehicle, while the adjacent attention module propagates this lane information from the previous frame to the current frame. The two modules are meticulously designed based on the transformer architecture. Finally, these long-short context features are fused with the current frame features to predict the lane lines in the current frame. Extensive quantitative and qualitative experiments are conducted on two prevalent benchmark datasets. The results demonstrate the effectiveness of our method, achieving several new state-of-the-art records. The codes and models are available at https://github.com/Alex-1337/LaneTCA

Via

Access Paper or Ask Questions

Scaling up Multimodal Pre-training for Sign Language Understanding

Aug 16, 2024

Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, Houqiang Li

Figure 1 for Scaling up Multimodal Pre-training for Sign Language Understanding

Figure 2 for Scaling up Multimodal Pre-training for Sign Language Understanding

Figure 3 for Scaling up Multimodal Pre-training for Sign Language Understanding

Figure 4 for Scaling up Multimodal Pre-training for Sign Language Understanding

Abstract:Sign language serves as the primary meaning of communication for the deaf-mute community. Different from spoken language, it commonly conveys information by the collaboration of manual features, i.e., hand gestures and body movements, and non-manual features, i.e., facial expressions and mouth cues. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied in recent years, including isolated/continuous sign language recognition (ISLR/CSLR), gloss-free sign language translation (GF-SLT) and sign language retrieval (SL-RT). Sign language recognition and translation aims to understand the semantic meaning conveyed by sign languages from gloss-level and sentence-level, respectively. In contrast, SL-RT focuses on retrieving sign videos or corresponding texts from a closed-set under the query-by-example search paradigm. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos. To advance the development of sign language understanding, exploring a generalized model that is applicable across various SLU tasks is a profound research direction.

* Sign language recognition; Sign language translation; Sign language retrieval

Via

Access Paper or Ask Questions

SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

Aug 07, 2024

Yonghui Wang, Shaokai Liu, Li Li, Wengang Zhou, Houqiang Li

Figure 1 for SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

Figure 2 for SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

Figure 3 for SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

Figure 4 for SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

Abstract:Shadow detection is a fundamental and challenging task in many computer vision applications. Intuitively, most shadows come from the occlusion of light by the object itself, resulting in the object and its shadow being contiguous (referred to as the adjacent shadow in this paper). In this case, when the color of the object is similar to that of the shadow, existing methods struggle to achieve accurate detection. To address this problem, we present SwinShadow, a transformer-based architecture that fully utilizes the powerful shifted window mechanism for detecting adjacent shadows. The mechanism operates in two steps. Initially, it applies local self-attention within a single window, enabling the network to focus on local details. Subsequently, it shifts the attention windows to facilitate inter-window attention, enabling the capture of a broader range of adjacent information. These combined steps significantly improve the network's capacity to distinguish shadows from nearby objects. And the whole process can be divided into three parts: encoder, decoder, and feature integration. During encoding, we adopt Swin Transformer to acquire hierarchical features. Then during decoding, for shallow layers, we propose a deep supervision (DS) module to suppress the false positives and boost the representation capability of shadow features for subsequent processing, while for deep layers, we leverage a double attention (DA) module to integrate local and shifted window in one stage to achieve a larger receptive field and enhance the continuity of information. Ultimately, a new multi-level aggregation (MLA) mechanism is applied to fuse the decoded features for mask prediction. Extensive experiments on three shadow detection benchmark datasets, SBU, UCF, and ISTD, demonstrate that our network achieves good performance in terms of balance error rate (BER).

Via

Access Paper or Ask Questions

SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Jul 23, 2024

Longtao Jiang, Min Wang, Zecheng Li, Yao Fang, Wengang Zhou, Houqiang Li

Figure 1 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Figure 2 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Figure 3 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Figure 4 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Abstract:Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.

* Accepted to ACM International Conference on Multimedia (MM) 2024

Via

Access Paper or Ask Questions

Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Jul 07, 2024

Qi Sun, Hang Zhou, Wengang Zhou, Li Li, Houqiang Li

Figure 1 for Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Figure 2 for Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Figure 3 for Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Figure 4 for Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis

Abstract:Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem. Forest2Seq organizes the inherently unordered collection of scene objects into structured, ordered hierarchical scene trees and forests. By employing a clustering-based algorithm and a breadth-first traversal, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively. Experimental results on standard benchmarks demonstrate Forest2Seq's superiority in synthesizing more realistic scenes compared to top-performing baselines, with significant improvements in FID and KL scores. Our additional experiments for downstream tasks and ablation studies also confirm the importance of incorporating order as a prior in 3D scene generation.

* ECCV 2024

Via

Access Paper or Ask Questions

RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

Jun 27, 2024

Zhaokang Liao, Hao Feng, Shaokai Liu, Wengang Zhou, Houqiang Li

Figure 1 for RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

Figure 2 for RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

Figure 3 for RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

Figure 4 for RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

Abstract:Fisheye images are categorized fisheye into central and deviated based on the optical center position. Existing rectification methods are limited to central fisheye images, while this paper proposes a novel method that extends to deviated fisheye image rectification. The challenge lies in the variant global distortion distribution pattern caused by the random optical center position. To address this challenge, we propose a distortion vector map (DVM) that measures the degree and direction of local distortion. By learning the DVM, the model can independently identify local distortions at each pixel without relying on global distortion patterns. The model adopts a pre-training and fine-tuning training paradigm. In the pre-training stage, it predicts the distortion vector map and perceives the local distortion features of each pixel. In the fine-tuning stage, it predicts a pixel-wise flow map for deviated fisheye image rectification. We also propose a data augmentation method mixing central, deviated, and distorted-free images. Such data augmentation promotes the model performance in rectifying both central and deviated fisheye images, compared with models trained on single-type fisheye images. Extensive experiments demonstrate the effectiveness and superiority of the proposed method.

Via

Access Paper or Ask Questions

Text-Animator: Controllable Visual Text Video Generation

Jun 25, 2024

Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian

Abstract:Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at https://laulampaul.github.io/text-animator.html.

* Project Page: https://laulampaul.github.io/text-animator.html

Via

Access Paper or Ask Questions

Semi-Supervised Spoken Language Glossification

Jun 12, 2024

Huijie Yao, Wengang Zhou, Hao Zhou, Houqiang Li

Abstract:Spoken language glossification (SLG) aims to translate the spoken language text into the sign language gloss, i.e., a written record of sign language. In this work, we present a framework named $S$emi-$S$upervised $S$poken $L$anguage $G$lossification ($S^3$LG) for SLG. To tackle the bottleneck of limited parallel data in SLG, our $S^3$LG incorporates large-scale monolingual spoken language text into SLG training. The proposed framework follows the self-training structure that iteratively annotates and learns from pseudo labels. Considering the lexical similarity and syntactic difference between sign language and spoken language, our $S^3$LG adopts both the rule-based heuristic and model-based approach for auto-annotation. During training, we randomly mix these complementary synthetic datasets and mark their differences with a special token. As the synthetic data may be less quality, the $S^3$LG further leverages consistency regularization to reduce the negative impact of noise in the synthetic data. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the $S^3$LG. Our code is available at \url{https://github.com/yaohj11/S3LG}.

* Accepted to ACL2024 main

Via

Access Paper or Ask Questions