Alert button
Picture for Hangjie Yuan

Hangjie Yuan

Alert button

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Aug 18, 2023
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

Figure 1 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Figure 2 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Figure 3 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Figure 4 for RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

* Accepted to ICCV 2023. Code and models: https://github.com/JacobYuan7/RLIPv2 
Viaarxiv icon

ModelScope Text-to-Video Technical Report

Aug 12, 2023
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang

Figure 1 for ModelScope Text-to-Video Technical Report
Figure 2 for ModelScope Text-to-Video Technical Report
Figure 3 for ModelScope Text-to-Video Technical Report
Figure 4 for ModelScope Text-to-Video Technical Report

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

* Technical report. Project page: \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary} 
Viaarxiv icon

VideoComposer: Compositional Video Synthesis with Motion Controllability

Jun 06, 2023
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou

Figure 1 for VideoComposer: Compositional Video Synthesis with Motion Controllability
Figure 2 for VideoComposer: Compositional Video Synthesis with Motion Controllability
Figure 3 for VideoComposer: Compositional Video Synthesis with Motion Controllability
Figure 4 for VideoComposer: Compositional Video Synthesis with Motion Controllability

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models will be publicly available at https://videocomposer.github.io.

* The first four authors contributed equally. Project page: https://videocomposer.github.io 
Viaarxiv icon

Refined Response Distillation for Class-Incremental Player Detection

May 01, 2023
Liang Bai, Hangjie Yuan, Tao Feng, Hong Song, Jian Yang

Figure 1 for Refined Response Distillation for Class-Incremental Player Detection
Figure 2 for Refined Response Distillation for Class-Incremental Player Detection
Figure 3 for Refined Response Distillation for Class-Incremental Player Detection
Figure 4 for Refined Response Distillation for Class-Incremental Player Detection

Detecting players from sports broadcast videos is essential for intelligent event analysis. However, existing methods assume fixed player categories, incapably accommodating the real-world scenarios where categories continue to evolve. Directly fine-tuning these methods on newly emerging categories also exist the catastrophic forgetting due to the non-stationary distribution. Inspired by recent research on incremental object detection (IOD), we propose a Refined Response Distillation (R^2D) method to effectively mitigate catastrophic forgetting for IOD tasks of the players. Firstly, we design a progressive coarse-to-fine distillation region dividing scheme, separating high-value and low-value regions from classification and regression responses for precise and fine-grained regional knowledge distillation. Subsequently, a tailored refined distillation strategy is developed on regions with varying significance to address the performance limitations posed by pronounced feature homogeneity in the IOD tasks of the players. Furthermore, we present the NBA-IOD and Volleyball-IOD datasets as the benchmark and investigate the IOD tasks of the players systematically. Extensive experiments conducted on benchmarks demonstrate that our method achieves state-of-the-art results.The code and datasets are available at https://github.com/beiyan1911/Players-IOD.

* 13 pages, 10 figures 
Viaarxiv icon

Progressive Learning without Forgetting

Nov 28, 2022
Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, Jianzhou Zhang

Figure 1 for Progressive Learning without Forgetting
Figure 2 for Progressive Learning without Forgetting
Figure 3 for Progressive Learning without Forgetting
Figure 4 for Progressive Learning without Forgetting

Learning from changing tasks and sequential experience without forgetting the obtained knowledge is a challenging problem for artificial neural networks. In this work, we focus on two challenging problems in the paradigm of Continual Learning (CL) without involving any old data: (i) the accumulation of catastrophic forgetting caused by the gradually fading knowledge space from which the model learns the previous knowledge; (ii) the uncontrolled tug-of-war dynamics to balance the stability and plasticity during the learning of new tasks. In order to tackle these problems, we present Progressive Learning without Forgetting (PLwF) and a credit assignment regime in the optimizer. PLwF densely introduces model functions from previous tasks to construct a knowledge space such that it contains the most reliable knowledge on each task and the distribution information of different tasks, while credit assignment controls the tug-of-war dynamics by removing gradient conflict through projection. Extensive ablative experiments demonstrate the effectiveness of PLwF and credit assignment. In comparison with other CL methods, we report notably better results even without relying on any raw data.

Viaarxiv icon

RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

Sep 05, 2022
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, Mingqian Tang

Figure 1 for RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Figure 2 for RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Figure 3 for RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Figure 4 for RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre-training (RLIP), a strategy for contrastive pre-training that leverages both entity and relation descriptions. To make effective use of such pre-training, we make three technical contributions: (1) a new Parallel entity detection and Sequential relation inference (ParSe) architecture that enables the use of both entity and relation descriptions during holistically optimized pre-training; (2) a synthetic data generation framework, Label Sequence Extension, that expands the scale of language data available within each minibatch; (3) mechanisms to account for ambiguity, Relation Quality Labels and Relation Pseudo-Labels, to mitigate the influence of ambiguous/noisy samples in the pre-training data. Through extensive experiments, we demonstrate the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection performance as well as increased robustness to learning from noisy annotations. Code will be available at \url{https://github.com/JacobYuan7/RLIP}.

* Tech report 
Viaarxiv icon

Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation

Apr 05, 2022
Tao Feng, Mang Wang, Hangjie Yuan

Figure 1 for Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Figure 2 for Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Figure 3 for Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Figure 4 for Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation

Traditional object detectors are ill-equipped for incremental learning. However, fine-tuning directly on a well-trained detection model with only new data will lead to catastrophic forgetting. Knowledge distillation is a flexible way to mitigate catastrophic forgetting. In Incremental Object Detection (IOD), previous work mainly focuses on distilling for the combination of features and responses. However, they under-explore the information that contains in responses. In this paper, we propose a response-based incremental distillation method, dubbed Elastic Response Distillation (ERD), which focuses on elastically learning responses from the classification head and the regression head. Firstly, our method transfers category knowledge while equipping student detector with the ability to retain localization information during incremental learning. In addition, we further evaluate the quality of all locations and provide valuable responses by the Elastic Response Selection (ERS) strategy. Finally, we elucidate that the knowledge from different responses should be assigned with different importance during incremental distillation. Extensive experiments conducted on MS COCO demonstrate our method achieves state-of-the-art result, which substantially narrows the performance gap towards full training.

* Accepted by CVPR 2022. arXiv admin note: substantial text overlap with arXiv:2110.13471 
Viaarxiv icon

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Feb 01, 2022
Hangjie Yuan, Mang Wang, Dong Ni, Liangpeng Xu

Figure 1 for Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Figure 2 for Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Figure 3 for Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Figure 4 for Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: object-guided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy. Similarity KL (SKL) loss is proposed to optimize VSM to align with the HOI dataset's priors. To overcome the static semantic embedding problem, we propose to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC). The above modules combined composes Object-guided Cross-modal Calibration Network (OCN). Experiments conducted on two popular HOI detection benchmarks demonstrate the significance of incorporating the statistical prior knowledge and produce state-of-the-art performances. More detailed analysis indicates proposed modules serve as a stronger verb predictor and a more superior method of utilizing prior knowledge. The codes are available at \url{https://github.com/JacobYuan7/OCN-HOI-Benchmark}.

* Accepted to AAAI2022 
Viaarxiv icon

Spatio-Temporal Dynamic Inference Network for Group Activity Recognition

Aug 26, 2021
Hangjie Yuan, Dong Ni, Mang Wang

Figure 1 for Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Figure 2 for Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Figure 3 for Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Figure 4 for Spatio-Temporal Dynamic Inference Network for Group Activity Recognition

Group activity recognition aims to understand the activity performed by a group of people. In order to solve it, modeling complex spatio-temporal interactions is the key. Previous methods are limited in reasoning on a predefined graph, which ignores the inherent person-specific interaction context. Moreover, they adopt inference schemes that are computationally expensive and easily result in the over-smoothing problem. In this paper, we manage to achieve spatio-temporal person-specific inferences by proposing Dynamic Inference Network (DIN), which composes of Dynamic Relation (DR) module and Dynamic Walk (DW) module. We firstly propose to initialize interaction fields on a primary spatio-temporal graph. Within each interaction field, we apply DR to predict the relation matrix and DW to predict the dynamic walk offsets in a joint-processing manner, thus forming a person-specific interaction graph. By updating features on the specific graph, a person can possess a global-level interaction field with a local initialization. Experiments indicate both modules' effectiveness. Moreover, DIN achieves significant improvement compared to previous state-of-the-art methods on two popular datasets under the same setting, while costing much less computation overhead of the reasoning module.

* Accepted to ICCV2021 
Viaarxiv icon