Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Houqiang Li

Masked Motion Predictors are Strong 3D Action Representation Learners

Aug 14, 2023

Yunyao Mao, Jiajun Deng, Wengang Zhou, Yao Fang, Wanli Ouyang, Houqiang Li

Figure 1 for Masked Motion Predictors are Strong 3D Action Representation Learners

Figure 2 for Masked Motion Predictors are Strong 3D Action Representation Learners

Figure 3 for Masked Motion Predictors are Strong 3D Action Representation Learners

Figure 4 for Masked Motion Predictors are Strong 3D Action Representation Learners

Abstract:In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an empirical semantic richness prior that guide the masking process, promoting better attention to semantically rich temporal regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP pre-training substantially improves the performance of the adopted vanilla transformer, achieving state-of-the-art results without bells and whistles. The source code of our MAMP is available at https://github.com/maoyunyao/MAMP.

* To appear in ICCV 2023

Via

Access Paper or Ask Questions

Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Aug 11, 2023

Yufei Yin, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li

Figure 1 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Figure 2 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Figure 3 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Figure 4 for Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

Abstract:Recent progress in weakly supervised object detection is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, with only image-level annotation, MIDN inevitably assigns high scores to some unexpected region proposals when generating pseudo labels. These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance. In this work, we explore how to ameliorate the quality of pseudo-labeling in MIDN. Formally, we devise Cyclic-Bootstrap Labeling (CBL), a novel weakly supervised object detection pipeline, which optimizes MIDN with rank information from a reliable teacher network. Specifically, we obtain this teacher network by introducing a weighted exponential moving average strategy to take advantage of various refinement modules. A novel class-specific ranking distillation algorithm is proposed to leverage the output of weighted ensembled teacher network for distilling MIDN with rank information. As a result, MIDN is guided to assign higher scores to accurate proposals among their neighboring ones, thus benefiting the subsequent pseudo labeling. Extensive experiments on the prevalent PASCAL VOC 2007 \& 2012 and COCO datasets demonstrate the superior performance of our CBL framework. Code will be available at https://github.com/Yinyf0804/WSOD-CBL/.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Aug 08, 2023

Weichao Zhao, Hezhen Hu, Wengang Zhou, Li li, Houqiang Li

Figure 1 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Figure 2 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Figure 3 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Figure 4 for Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Abstract:Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.

* 16 pages

Via

Access Paper or Ask Questions

AltFreezing for More General Video Face Forgery Detection

Jul 17, 2023

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Houqiang Li

Abstract:Existing face forgery detection models try to discriminate fake images by detecting only spatial artifacts (e.g., generative artifacts, blending) or mainly temporal artifacts (e.g., flickering, discontinuity). They may experience significant performance degradation when facing out-domain artifacts. In this paper, we propose to capture both spatial and temporal artifacts in one model for face forgery detection. A simple idea is to leverage a spatiotemporal model (3D ConvNet). However, we find that it may easily rely on one type of artifact and ignore the other. To address this issue, we present a novel training strategy called AltFreezing for more general face forgery detection. The AltFreezing aims to encourage the model to detect both spatial and temporal artifacts. It divides the weights of a spatiotemporal network into two groups: spatial-related and temporal-related. Then the two groups of weights are alternately frozen during the training process so that the model can learn spatial and temporal features to distinguish real or fake videos. Furthermore, we introduce various video-level data augmentation methods to improve the generalization capability of the forgery detection model. Extensive experiments show that our framework outperforms existing methods in terms of generalization to unseen manipulations and datasets. Code is available at https: //github.com/ZhendongWang6/AltFreezing.

* Accepted by CVPR 2023 Highlight, code and models are available at https: //github.com/ZhendongWang6/AltFreezing

Via

Access Paper or Ask Questions

LVVC: A Learned Versatile Video Coding Framework for Efficient Human-Machine Vision

Jun 19, 2023

Xihua Sheng, Li Li, Dong Liu, Houqiang Li

Figure 1 for LVVC: A Learned Versatile Video Coding Framework for Efficient Human-Machine Vision

Figure 2 for LVVC: A Learned Versatile Video Coding Framework for Efficient Human-Machine Vision

Figure 3 for LVVC: A Learned Versatile Video Coding Framework for Efficient Human-Machine Vision

Figure 4 for LVVC: A Learned Versatile Video Coding Framework for Efficient Human-Machine Vision

Abstract:Almost all digital videos are coded into compact representations before being transmitted. Such compact representations need to be decoded back to pixels before being displayed to human and - as usual - before being processed/analyzed by machine vision algorithms. For machine vision, it is more efficient at least conceptually, to process/analyze the coded representations directly without decoding them into pixels. Motivated by this concept, we propose a learned versatile video coding (LVVC) framework, which targets on learning compact representations to support both decoding and direct processing/analysis, thereby being versatile for both human and machine vision. Our LVVC framework has a feature-based compression loop, where one frame is encoded (resp. decoded) to intermediate features, and the intermediate features are referenced for encoding (resp. decoding) the following frames. Our proposed feature-based compression loop has two key technologies, one is feature-based temporal context mining, and the other is cross-domain motion encoder/decoder. With the LVVC framework, the intermediate features may be used to reconstruct videos, or be fed into different task networks. The LVVC framework is implemented and evaluated with video reconstruction, video processing, and video analysis tasks on the well-established benchmark datasets. The evaluation results demonstrate the compression efficiency of the proposed LVVC framework.

Via

Access Paper or Ask Questions

Exploring Effective Mask Sampling Modeling for Neural Image Compression

Jun 09, 2023

Lin Liu, Mingming Zhao, Shanxin Yuan, Wenlong Lyu, Wengang Zhou, Houqiang Li, Yanfeng Wang, Qi Tian

Figure 1 for Exploring Effective Mask Sampling Modeling for Neural Image Compression

Figure 2 for Exploring Effective Mask Sampling Modeling for Neural Image Compression

Figure 3 for Exploring Effective Mask Sampling Modeling for Neural Image Compression

Figure 4 for Exploring Effective Mask Sampling Modeling for Neural Image Compression

Abstract:Image compression aims to reduce the information redundancy in images. Most existing neural image compression methods rely on side information from hyperprior or context models to eliminate spatial redundancy, but rarely address the channel redundancy. Inspired by the mask sampling modeling in recent self-supervised learning methods for natural language processing and high-level vision, we propose a novel pretraining strategy for neural image compression. Specifically, Cube Mask Sampling Module (CMSM) is proposed to apply both spatial and channel mask sampling modeling to image compression in the pre-training stage. Moreover, to further reduce channel redundancy, we propose the Learnable Channel Mask Module (LCMM) and the Learnable Channel Completion Module (LCCM). Our plug-and-play CMSM, LCMM, LCCM modules can apply to both CNN-based and Transformer-based architectures, significantly reduce the computational cost, and improve the quality of images. Experiments on the public Kodak and Tecnick datasets demonstrate that our method achieves competitive performance with lower computational complexity compared to state-of-the-art image compression methods.

* 10 pages

Via

Access Paper or Ask Questions

MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning

Jun 03, 2023

Haolin Song, Mingxiao Feng, Wengang Zhou, Houqiang Li

Figure 1 for MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning

Figure 2 for MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning

Figure 3 for MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning

Figure 4 for MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning

Abstract:Recent approaches have utilized self-supervised auxiliary tasks as representation learning to improve the performance and sample efficiency of vision-based reinforcement learning algorithms in single-agent settings. However, in multi-agent reinforcement learning (MARL), these techniques face challenges because each agent only receives partial observation from an environment influenced by others, resulting in correlated observations in the agent dimension. So it is necessary to consider agent-level information in representation learning for MARL. In this paper, we propose an effective framework called \textbf{M}ulti-\textbf{A}gent \textbf{M}asked \textbf{A}ttentive \textbf{C}ontrastive \textbf{L}earning (MA2CL), which encourages learning representation to be both temporal and agent-level predictive by reconstructing the masked agent observation in latent space. Specifically, we use an attention reconstruction model for recovering and the model is trained via contrastive learning. MA2CL allows better utilization of contextual information at the agent level, facilitating the training of MARL agents for cooperation tasks. Extensive experiments demonstrate that our method significantly improves the performance and sample efficiency of different MARL algorithms and outperforms other methods in various vision-based and state-based scenarios. Our code can be found in \url{https://github.com/ustchlsong/MA2CL}

Via

Access Paper or Ask Questions

Detect Any Shadow: Segment Anything for Video Shadow Detection

May 26, 2023

Yonghui Wang, Wengang Zhou, Yunyao Mao, Houqiang Li

Figure 1 for Detect Any Shadow: Segment Anything for Video Shadow Detection

Figure 2 for Detect Any Shadow: Segment Anything for Video Shadow Detection

Figure 3 for Detect Any Shadow: Segment Anything for Video Shadow Detection

Figure 4 for Detect Any Shadow: Segment Anything for Video Shadow Detection

Abstract:Segment anything model (SAM) has achieved great success in the field of natural image segmentation. Nevertheless, SAM tends to classify shadows as background, resulting in poor segmentation performance for shadow detection task. In this paper, we propose an simple but effective approach for fine tuning SAM to detect shadows. Additionally, we also combine it with long short-term attention mechanism to extend its capabilities to video shadow detection. Specifically, we first fine tune SAM by utilizing shadow data combined with sparse prompts and apply the fine-tuned model to detect a specific frame (e.g., first frame) in the video with a little user assistance. Subsequently, using the detected frame as a reference, we employ a long short-term network to learn spatial correlations between distant frames and temporal consistency between contiguous frames, thereby achieving shadow information propagation across frames. Extensive experimental results demonstrate that our method outperforms the state-of-the-art techniques, with improvements of 17.2% and 3.3% in terms of MAE and IoU, respectively, validating the effectiveness of our method.

Via

Access Paper or Ask Questions

Hybrid and Collaborative Passage Reranking

May 16, 2023

Zongmeng Zhang, Wengang Zhou, Jiaxin Shi, Houqiang Li

Figure 1 for Hybrid and Collaborative Passage Reranking

Figure 2 for Hybrid and Collaborative Passage Reranking

Figure 3 for Hybrid and Collaborative Passage Reranking

Figure 4 for Hybrid and Collaborative Passage Reranking

Abstract:In passage retrieval system, the initial passage retrieval results may be unsatisfactory, which can be refined by a reranking scheme. Existing solutions to passage reranking focus on enriching the interaction between query and each passage separately, neglecting the context among the top-ranked passages in the initial retrieval list. To tackle this problem, we propose a Hybrid and Collaborative Passage Reranking (HybRank) method, which leverages the substantial similarity measurements of upstream retrievers for passage collaboration and incorporates the lexical and semantic properties of sparse and dense retrievers for reranking. Besides, built on off-the-shelf retriever features, HybRank is a plug-in reranker capable of enhancing arbitrary passage lists including previously reranked ones. Extensive experiments demonstrate the stable improvements of performance over prevalent retrieval and reranking methods, and verify the effectiveness of the core components of HybRank.

* Accepted to Findings of ACL 2023

Via

Access Paper or Ask Questions

SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding

May 08, 2023

Hezhen Hu, Weichao Zhao, Wengang Zhou, Houqiang Li

Figure 1 for SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding

Figure 2 for SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding

Figure 3 for SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding

Figure 4 for SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding

Abstract:Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In our framework, the hand pose is regarded as a visual token, which is derived from an off-the-shelf detector. Each visual token is embedded with gesture state and spatial-temporal position encoding. To take full advantage of current sign data resource, we first perform self-supervised learning to model its statistics. To this end, we design multi-level masked modeling strategies (joint, frame and clip) to mimic common failure detection cases. Jointly with these masked modeling strategies, we incorporate model-aware hand prior to better capture hierarchical context over the sequence. After the pre-training, we carefully design simple yet effective prediction heads for downstream tasks. To validate the effectiveness of our framework, we perform extensive experiments on three main SLU tasks, involving isolated and continuous sign language recognition (SLR), and sign language translation (SLT). Experimental results demonstrate the effectiveness of our method, achieving new state-of-the-art performance with a notable gain.

* Accepted to TPAMI. Project Page: https://signbert-zoo.github.io/

Via

Access Paper or Ask Questions