Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Linchao Zhu

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Aug 03, 2022

Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian(+1 more)

Figure 1 for Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Figure 2 for Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Figure 3 for Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Figure 4 for Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Abstract:Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.

* Accepted by ACM Multimedia 2022

Via

Access Paper or Ask Questions

PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning

Jul 07, 2022

Shannan Guan, Haiyan Lu, Linchao Zhu, Gengfa Fang

Figure 1 for PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning

Figure 2 for PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning

Figure 3 for PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning

Figure 4 for PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning

Abstract:3D pose estimation has recently gained substantial interests in computer vision domain. Existing 3D pose estimation methods have a strong reliance on large size well-annotated 3D pose datasets, and they suffer poor model generalization on unseen poses due to limited diversity of 3D poses in training sets. In this work, we propose PoseGU, a novel human pose generator that generates diverse poses with access only to a small size of seed samples, while equipping the Counterfactual Risk Minimization to pursue an unbiased evaluation objective. Extensive experiments demonstrate PoseGU outforms almost all the state-of-the-art 3D human pose methods under consideration over three popular benchmark datasets. Empirical analysis also proves PoseGU generates 3D poses with improved data diversity and better generalization ability.

Via

Access Paper or Ask Questions

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

May 02, 2022

Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

Figure 1 for CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Figure 2 for CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Figure 3 for CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Figure 4 for CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Abstract:Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35\% and accelerating the inference speed by 14\% at the best case. The code is available at \href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.

* accepted by SIGIR 2022, code is at https://github.com/mzhaoshuai/CenterCLIP

Via

Access Paper or Ask Questions

Unified Transformer Tracker for Object Tracking

Mar 29, 2022

Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan

Figure 1 for Unified Transformer Tracker for Object Tracking

Figure 2 for Unified Transformer Tracker for Object Tracking

Figure 3 for Unified Transformer Tracker for Object Tracking

Figure 4 for Unified Transformer Tracker for Object Tracking

Abstract:As an important area in computer vision, object tracking has formed two separate communities that respectively study Single Object Tracking (SOT) and Multiple Object Tracking (MOT). However, current methods in one tracking scenario are not easily adapted to the other due to the divergent training datasets and tracking objects of both tasks. Although UniTrack \cite{wang2021different} demonstrates that a shared appearance model with multiple heads can be used to tackle individual tracking tasks, it fails to exploit the large-scale tracking datasets for training and performs poorly on single object tracking. In this work, we present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm. A track transformer is developed in our UTT to track the target in both SOT and MOT. The correlation between the target and tracking frame features is exploited to localize the target. We demonstrate that both SOT and MOT tasks can be solved within this framework. The model can be simultaneously end-to-end trained by alternatively optimizing the SOT and MOT objectives on the datasets of individual tasks. Extensive experiments are conducted on several benchmarks with a unified model trained on SOT and MOT datasets. Code will be available at https://github.com/Flowerfan/Trackron.

* CVPR 2022

Via

Access Paper or Ask Questions

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Mar 28, 2022

Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, Xin Eric Wang

Figure 1 for Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Figure 2 for Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Figure 3 for Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Figure 4 for Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Abstract:Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding beyond pre-defined classes and has received increasing attention in recent years. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, current temporal grounding datasets do not specifically test for the compositional generalizability. To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. Evaluating the state-of-the-art methods on our new dataset splits, we empirically find that they fail to generalize to queries with novel combinations of seen words. To tackle this challenge, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies and learns fine-grained semantic correspondence among them. Experiments illustrate the superior compositional generalizability of our approach. The repository of this work is at https://github.com/YYJMJC/ Compositional-Temporal-Grounding.

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022

Via

Access Paper or Ask Questions

Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

Aug 15, 2021

Aming Wu, Rui Liu, Yahong Han, Linchao Zhu, Yi Yang

Figure 1 for Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

Figure 2 for Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

Figure 3 for Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

Figure 4 for Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

Abstract:To improve the generalization of detectors, for domain adaptive object detection (DAOD), recent advances mainly explore aligning feature-level distributions between the source and single-target domain, which may neglect the impact of domain-specific information existing in the aligned features. Towards DAOD, it is important to extract domain-invariant object representations. To this end, in this paper, we try to disentangle domain-invariant representations from domain-specific representations. And we propose a novel disentangled method based on vector decomposition. Firstly, an extractor is devised to separate domain-invariant representations from the input, which are used for extracting object proposals. Secondly, domain-specific representations are introduced as the differences between the input and domain-invariant representations. Through the difference operation, the gap between the domain-specific and domain-invariant representations is enlarged, which promotes domain-invariant representations to contain more domain-irrelevant information. In the experiment, we separately evaluate our method on the single- and compound-target case. For the single-target case, experimental results of four domain-shift scenes show our method obtains a significant performance gain over baseline methods. Moreover, for the compound-target case (i.e., the target is a compound of two different domains without domain labels), our method outperforms baseline methods by around 4%, which demonstrates the effectiveness of our method.

* Accepted by ICCV 2021

Via

Access Paper or Ask Questions

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Aug 09, 2021

Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, Yueting Zhuang

Figure 1 for Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Figure 2 for Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Figure 3 for Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Figure 4 for Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Abstract:Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. This new task requires a model to draw inference on whether a natural language statement entails or contradicts a given video clip. In this paper, we study how to address three critical challenges for this task: judging the global correctness of the statement involved multiple semantic meanings, joint reasoning over video and subtitles, and modeling long-range relationships and complex social interactions. First, we propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions. Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement. Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies. The semantic coherence learning can further improve the alignment between vision and linguistics, and the coherence across a sequence of video segments. Experimental results show that our method significantly outperforms the baseline by a large margin.

Via

Access Paper or Ask Questions

Less is More: Sparse Sampling for Dense Reaction Predictions

Jun 03, 2021

Kezhou Lin, Xiaohan Wang, Zhedong Zheng, Linchao Zhu, Yi Yang

Figure 1 for Less is More: Sparse Sampling for Dense Reaction Predictions

Figure 2 for Less is More: Sparse Sampling for Dense Reaction Predictions

Figure 3 for Less is More: Sparse Sampling for Dense Reaction Predictions

Figure 4 for Less is More: Sparse Sampling for Dense Reaction Predictions

Abstract:Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience. In this report, we present our method for 2021 Evoked Expression from Videos Challenge. In particular, our model utilizes both audio and image modalities as inputs to predict emotion changes of viewers. To model long-range emotion changes, we use a GRU-based model to predict one sparse signal with 1Hz. We observe that the emotion changes are smooth. Therefore, the final dense prediction is obtained via linear interpolating the signal, which is robust to the prediction fluctuation. Albeit simple, the proposed method has achieved pearson's correlation score of 0.04430 on the final private test set.

* Code is available at: https://github.com/HenryLittle/EEV-Challenge-2021

Via

Access Paper or Ask Questions

OR-Net: Pointwise Relational Inference for Data Completion under Partial Observation

May 05, 2021

Qianyu Feng, Linchao Zhu, Bang Zhang, Pan Pan, Yi Yang

Figure 1 for OR-Net: Pointwise Relational Inference for Data Completion under Partial Observation

Figure 2 for OR-Net: Pointwise Relational Inference for Data Completion under Partial Observation

Figure 3 for OR-Net: Pointwise Relational Inference for Data Completion under Partial Observation

Figure 4 for OR-Net: Pointwise Relational Inference for Data Completion under Partial Observation

Abstract:Contemporary data-driven methods are typically fed with full supervision on large-scale datasets which limits their applicability. However, in the actual systems with limitations such as measurement error and data acquisition problems, people usually obtain incomplete data. Although data completion has attracted wide attention, the underlying data pattern and relativity are still under-developed. Currently, the family of latent variable models allows learning deep latent variables over observed variables by fitting the marginal distribution. As far as we know, current methods fail to perceive the data relativity under partial observation. Aiming at modeling incomplete data, this work uses relational inference to fill in the incomplete data. Specifically, we expect to approximate the real joint distribution over the partial observation and latent variables, thus infer the unseen targets respectively. To this end, we propose Omni-Relational Network (OR-Net) to model the pointwise relativity in two aspects: (i) On one hand, the inner relationship is built among the context points in the partial observation; (ii) On the other hand, the unseen targets are inferred by learning the cross-relationship with the observed data points. It is further discovered that the proposed method can be generalized to different scenarios regardless of whether the physical structure can be observed or not. It is demonstrated that the proposed OR-Net can be well generalized for data completion tasks of various modalities, including function regression, image completion on MNIST and CelebA datasets, and also sequential motion generation conditioned on the observed poses.

Via

Access Paper or Ask Questions

Faster Meta Update Strategy for Noise-Robust Deep Learning

Apr 30, 2021

Youjiang Xu, Linchao Zhu, Lu Jiang, Yi Yang

Figure 1 for Faster Meta Update Strategy for Noise-Robust Deep Learning

Figure 2 for Faster Meta Update Strategy for Noise-Robust Deep Learning

Figure 3 for Faster Meta Update Strategy for Noise-Robust Deep Learning

Figure 4 for Faster Meta Update Strategy for Noise-Robust Deep Learning

Abstract:It has been shown that deep neural networks are prone to overfitting on biased training data. Towards addressing this issue, meta-learning employs a meta model for correcting the training bias. Despite the promising performances, super slow training is currently the bottleneck in the meta learning approaches. In this paper, we introduce a novel Faster Meta Update Strategy (FaMUS) to replace the most expensive step in the meta gradient computation with a faster layer-wise approximation. We empirically find that FaMUS yields not only a reasonably accurate but also a low-variance approximation of the meta gradient. We conduct extensive experiments to verify the proposed method on two tasks. We show our method is able to save two-thirds of the training time while still maintaining the comparable or achieving even better generalization performance. In particular, our method achieves the state-of-the-art performance on both synthetic and realistic noisy labels, and obtains promising performance on long-tailed recognition on standard benchmarks.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions