Alert button
Picture for Houqiang Li

Houqiang Li

Alert button

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

Sep 02, 2023
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang

Figure 1 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Figure 2 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Figure 3 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Figure 4 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.

Viaarxiv icon

Sign Language Translation with Iterative Prototype

Aug 23, 2023
Huijie Yao, Wengang Zhou, Hao Feng, Hezhen Hu, Hao Zhou, Houqiang Li

Figure 1 for Sign Language Translation with Iterative Prototype
Figure 2 for Sign Language Translation with Iterative Prototype
Figure 3 for Sign Language Translation with Iterative Prototype
Figure 4 for Sign Language Translation with Iterative Prototype

This paper presents IP-SLT, a simple yet effective framework for sign language translation (SLT). Our IP-SLT adopts a recurrent structure and enhances the semantic representation (prototype) of the input sign language video via an iterative refinement manner. Our idea mimics the behavior of human reading, where a sentence can be digested repeatedly, till reaching accurate understanding. Technically, IP-SLT consists of feature extraction, prototype initialization, and iterative prototype refinement. The initialization module generates the initial prototype based on the visual feature extracted by the feature extraction module. Then, the iterative refinement module leverages the cross-attention mechanism to polish the previous prototype by aggregating it with the original video feature. Through repeated refinement, the prototype finally converges to a more stable and accurate state, leading to a fluent and appropriate translation. In addition, to leverage the sequential dependence of prototypes, we further propose an iterative distillation loss to compress the knowledge of the final iteration into previous ones. As the autoregressive decoding process is executed only once in inference, our IP-SLT is ready to improve various SLT systems with acceptable overhead. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the IP-SLT.

* Accepted by ICCV 2023 
Viaarxiv icon

SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

Aug 17, 2023
Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li

Figure 1 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
Figure 2 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
Figure 3 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
Figure 4 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

In fisheye images, rich distinct distortion patterns are regularly distributed in the image plane. These distortion patterns are independent of the visual content and provide informative cues for rectification. To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning. Technically, we first split a fisheye image into multiple patches and extract their representations with a Vision Transformer (ViT). To learn fine-grained distortion representations, we then associate different image patches with their specific distortion patterns based on the fisheye model, and further subtly design an innovative unified distortion-aware pretext task for their learning. The transfer performance on the downstream rectification task is remarkably boosted, which verifies the effectiveness of the learned representations. Extensive experiments are conducted, and the quantitative and qualitative results demonstrate the superiority of our method over the state-of-the-art algorithms as well as its strong generalization ability on real-world fisheye images.

* Accepted to ICCV 2023 
Viaarxiv icon

Text-Only Training for Visual Storytelling

Aug 17, 2023
Yuechen Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li

Figure 1 for Text-Only Training for Visual Storytelling
Figure 2 for Text-Only Training for Visual Storytelling
Figure 3 for Text-Only Training for Visual Storytelling
Figure 4 for Text-Only Training for Visual Storytelling

Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only text data for training enables our method to learn from external text story data, enhancing the generalization capability of visual storytelling. We conduct extensive experiments on the VIST benchmark, showcasing the effectiveness of our approach in both in-domain and cross-domain settings. Further evaluations on expression diversity and human assessment underscore the superiority of our method in terms of informativeness and robustness.

* ACM MM 2023 
Viaarxiv icon

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Aug 16, 2023
Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan

Figure 1 for DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Figure 2 for DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Figure 3 for DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Figure 4 for DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}

Viaarxiv icon

Masked Motion Predictors are Strong 3D Action Representation Learners

Aug 14, 2023
Yunyao Mao, Jiajun Deng, Wengang Zhou, Yao Fang, Wanli Ouyang, Houqiang Li

Figure 1 for Masked Motion Predictors are Strong 3D Action Representation Learners
Figure 2 for Masked Motion Predictors are Strong 3D Action Representation Learners
Figure 3 for Masked Motion Predictors are Strong 3D Action Representation Learners
Figure 4 for Masked Motion Predictors are Strong 3D Action Representation Learners

In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an empirical semantic richness prior that guide the masking process, promoting better attention to semantically rich temporal regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP pre-training substantially improves the performance of the adopted vanilla transformer, achieving state-of-the-art results without bells and whistles. The source code of our MAMP is available at https://github.com/maoyunyao/MAMP.

* To appear in ICCV 2023 
Viaarxiv icon

Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Aug 11, 2023
Yufei Yin, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li

Figure 1 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection
Figure 2 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection
Figure 3 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection
Figure 4 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Recent progress in weakly supervised object detection is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, with only image-level annotation, MIDN inevitably assigns high scores to some unexpected region proposals when generating pseudo labels. These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance. In this work, we explore how to ameliorate the quality of pseudo-labeling in MIDN. Formally, we devise Cyclic-Bootstrap Labeling (CBL), a novel weakly supervised object detection pipeline, which optimizes MIDN with rank information from a reliable teacher network. Specifically, we obtain this teacher network by introducing a weighted exponential moving average strategy to take advantage of various refinement modules. A novel class-specific ranking distillation algorithm is proposed to leverage the output of weighted ensembled teacher network for distilling MIDN with rank information. As a result, MIDN is guided to assign higher scores to accurate proposals among their neighboring ones, thus benefiting the subsequent pseudo labeling. Extensive experiments on the prevalent PASCAL VOC 2007 \& 2012 and COCO datasets demonstrate the superior performance of our CBL framework. Code will be available at https://github.com/Yinyf0804/WSOD-CBL/.

* Accepted by ICCV 2023 
Viaarxiv icon

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Aug 08, 2023
Weichao Zhao, Hezhen Hu, Wengang Zhou, Li li, Houqiang Li

Figure 1 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video
Figure 2 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video
Figure 3 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video
Figure 4 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.

* 16 pages 
Viaarxiv icon

AltFreezing for More General Video Face Forgery Detection

Jul 17, 2023
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Houqiang Li

Figure 1 for AltFreezing for More General Video Face Forgery Detection
Figure 2 for AltFreezing for More General Video Face Forgery Detection
Figure 3 for AltFreezing for More General Video Face Forgery Detection
Figure 4 for AltFreezing for More General Video Face Forgery Detection

Existing face forgery detection models try to discriminate fake images by detecting only spatial artifacts (e.g., generative artifacts, blending) or mainly temporal artifacts (e.g., flickering, discontinuity). They may experience significant performance degradation when facing out-domain artifacts. In this paper, we propose to capture both spatial and temporal artifacts in one model for face forgery detection. A simple idea is to leverage a spatiotemporal model (3D ConvNet). However, we find that it may easily rely on one type of artifact and ignore the other. To address this issue, we present a novel training strategy called AltFreezing for more general face forgery detection. The AltFreezing aims to encourage the model to detect both spatial and temporal artifacts. It divides the weights of a spatiotemporal network into two groups: spatial-related and temporal-related. Then the two groups of weights are alternately frozen during the training process so that the model can learn spatial and temporal features to distinguish real or fake videos. Furthermore, we introduce various video-level data augmentation methods to improve the generalization capability of the forgery detection model. Extensive experiments show that our framework outperforms existing methods in terms of generalization to unseen manipulations and datasets. Code is available at https: //github.com/ZhendongWang6/AltFreezing.

* Accepted by CVPR 2023 Highlight, code and models are available at https: //github.com/ZhendongWang6/AltFreezing 
Viaarxiv icon