Alert button
Picture for Junqi Liu

Junqi Liu

Alert button

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Aug 12, 2022
Jingcheng Ni, Nan Zhou, Jie Qin, Qian Wu, Junqi Liu, Boxun Li, Di Huang

Figure 1 for Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Figure 2 for Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Figure 3 for Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Figure 4 for Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Contrastive learning has shown great potential in video representation learning. However, existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial to various down-stream video understanding tasks. In this paper, we propose Motion Sensitive Contrastive Learning (MSCL) that injects the motion information captured by optical flows into RGB frames to strengthen feature learning. To achieve this, in addition to clip-level global contrastive learning, we develop Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Moreover, we introduce Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples. Extensive experiments on standard benchmarks validate the effectiveness of the proposed method. With the commonly-used 3D ResNet-18 as the backbone, we achieve the top-1 accuracies of 91.5\% on UCF101 and 50.3\% on Something-Something v2 for video classification, and a 65.6\% Top-1 Recall on UCF101 for video retrieval, notably improving the state-of-the-art.

* Accepted by ECCV2022, 17 pages 
Viaarxiv icon

End-to-End Human Object Interaction Detection with HOI Transformer

Mar 08, 2021
Cheng Zou, Bohan Wang, Yue Hu, Junqi Liu, Qian Wu, Yu Zhao, Boxun Li, Chenguang Zhang, Chi Zhang, Yichen Wei, Jian Sun

Figure 1 for End-to-End Human Object Interaction Detection with HOI Transformer
Figure 2 for End-to-End Human Object Interaction Detection with HOI Transformer
Figure 3 for End-to-End Human Object Interaction Detection with HOI Transformer
Figure 4 for End-to-End Human Object Interaction Detection with HOI Transformer

We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves $26.61\% $ $ AP $ on HICO-DET and $52.9\%$ $AP_{role}$ on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer .

* Accepted to CVPR2021 
Viaarxiv icon

Detailed 2D-3D Joint Representation for Human-Object Interaction

May 21, 2020
Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, Cewu Lu

Figure 1 for Detailed 2D-3D Joint Representation for Human-Object Interaction
Figure 2 for Detailed 2D-3D Joint Representation for Human-Object Interaction
Figure 3 for Detailed 2D-3D Joint Representation for Human-Object Interaction
Figure 4 for Detailed 2D-3D Joint Representation for Human-Object Interaction

Human-Object Interaction (HOI) detection lies at the core of action understanding. Besides 2D information such as human/object appearance and locations, 3D pose is also usually utilized in HOI learning since its view-independence. However, rough 3D body joints just carry sparse body information and are not sufficient to understand complex interactions. Thus, we need detailed 3D body shape to go further. Meanwhile, the interacted object in 3D is also not fully studied in HOI learning. In light of these, we propose a detailed 2D-3D joint representation learning method. First, we utilize the single-view human body capture method to obtain detailed 3D body, face and hand shapes. Next, we estimate the 3D object location and size with reference to the 2D human-object spatial configuration and object category priors. Finally, a joint learning framework and cross-modal consistency tasks are proposed to learn the joint HOI representation. To better evaluate the 2D ambiguity processing capacity of models, we propose a new benchmark named Ambiguous-HOI consisting of hard ambiguous images. Extensive experiments in large-scale HOI benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code and data are available at https://github.com/DirtyHarryLYL/DJ-RN.

* Accepted to CVPR 2020, supplementary materials included, code available:https://github.com/DirtyHarryLYL/DJ-RN 
Viaarxiv icon