Alert button
Picture for Dan Guo

Dan Guo

Alert button

Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding

Sep 12, 2023
Jiaxiu Li, Kun Li, Jia Li, Guoliang Chen, Dan Guo, Meng Wang

Make-up temporal video grounding (MTVG) aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video. Compared with the general video grounding task, MTVG focuses on meticulous actions and changes on the face. The make-up instruction step, usually involving detailed differences in products and facial areas, is more fine-grained than general activities (e.g, cooking activity and furniture assembly). Thus, existing general approaches cannot locate the target activity effectually. More specifically, existing proposal generation modules are not yet fully developed in providing semantic cues for the more fine-grained make-up semantic comprehension. To tackle this issue, we propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic details of make-up activities. DPTMO extracts both query-agnostic and query-guided features to construct two proposal sets and uses specific evaluation methods for the two sets. Different from the commonly used single structure in previous methods, our dual-path structure can mine more semantic information in make-up videos and distinguish fine-grained actions well. These two candidate sets represent the cross-modal makeup video-text similarity and multi-modal fusion relationship, complementing each other. Each set corresponds to its respective optimization perspective, and their joint prediction enhances the accuracy of video timestamp prediction. Comprehensive experiments on the YouMakeup dataset demonstrate our proposed dual structure excels in fine-grained semantic comprehension.

Viaarxiv icon

Exploiting Diverse Feature for Multimodal Sentiment Analysis

Aug 25, 2023
Jia Li, Wei Qian, Kun Li, Qi Li, Dan Guo, Meng Wang

Figure 1 for Exploiting Diverse Feature for Multimodal Sentiment Analysis
Figure 2 for Exploiting Diverse Feature for Multimodal Sentiment Analysis
Figure 3 for Exploiting Diverse Feature for Multimodal Sentiment Analysis
Figure 4 for Exploiting Diverse Feature for Multimodal Sentiment Analysis

In this paper, we present our solution to the MuSe-Personalisation sub-challenge in the MuSe 2023 Multimodal Sentiment Analysis Challenge. The task of MuSe-Personalisation aims to predict the continuous arousal and valence values of a participant based on their audio-visual, language, and physiological signal modalities data. Considering different people have personal characteristics, the main challenge of this task is how to build robustness feature presentation for sentiment prediction. To address this issue, we propose exploiting diverse features. Specifically, we proposed a series of feature extraction methods to build a robust representation and model ensemble. We empirically evaluate the performance of the utilized method on the officially provided dataset. \textbf{As a result, we achieved 3rd place in the MuSe-Personalisation sub-challenge.} Specifically, we achieve the results of 0.8492 and 0.8439 for MuSe-Personalisation in terms of arousal and valence CCC.

Viaarxiv icon

Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

Aug 15, 2023
Wei Qian, Dan Guo, Kun Li, Xilan Tian, Meng Wang

Figure 1 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos
Figure 2 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos
Figure 3 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos
Figure 4 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

Remote photoplethysmography (rPPG) based physiological measurement is an emerging yet crucial vision task, whose challenge lies in exploring accurate rPPG prediction from facial videos accompanied by noises of illumination variations, facial occlusions, head movements, \etc, in a non-contact manner. Existing mainstream CNN-based models make efforts to detect physiological signals by capturing subtle color changes in facial regions of interest (ROI) caused by heartbeats. However, such models are constrained by the limited local spatial or temporal receptive fields in the neural units. Unlike them, a native Transformer-based framework called Dual-path TokenLearner (Dual-TL) is proposed in this paper, which utilizes the concept of learnable tokens to integrate both spatial and temporal informative contexts from the global perspective of the video. Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances. Complementarily, a Temporal TokenLearner (T-TL) is designed to infer the quasi-periodic pattern of heartbeats, which eliminates temporal disturbances such as head movements. The two TokenLearners, S-TL and T-TL, are executed in a dual-path mode. This enables the model to reduce noise disturbances for final rPPG signal prediction. Extensive experiments on four physiological measurement benchmark datasets are conducted. The Dual-TL achieves state-of-the-art performances in both intra- and cross-dataset testings, demonstrating its immense potential as a basic backbone for rPPG measurement. The source code is available at \href{https://github.com/VUT-HFUT/Dual-TL}{https://github.com/VUT-HFUT/Dual-TL}

Viaarxiv icon

M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector

Aug 11, 2023
Yen Nhi Truong Vu, Dan Guo, Ahmed Taha, Jason Su, Thomas Paul Matthews

Figure 1 for M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector
Figure 2 for M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector
Figure 3 for M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector
Figure 4 for M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector

Deep-learning-based object detection methods show promise for improving screening mammography, but high rates of false positives can hinder their effectiveness in clinical practice. To reduce false positives, we identify three challenges: (1) unlike natural images, a malignant mammogram typically contains only one malignant finding; (2) mammography exams contain two views of each breast, and both views ought to be considered to make a correct assessment; (3) most mammograms are negative and do not contain any findings. In this work, we tackle the three aforementioned challenges by: (1) leveraging Sparse R-CNN and showing that sparse detectors are more appropriate than dense detectors for mammography; (2) including a multi-view cross-attention module to synthesize information from different views; (3) incorporating multi-instance learning (MIL) to train with unannotated images and perform breast-level classification. The resulting model, M&M, is a Multi-view and Multi-instance learning system that can both localize malignant findings and provide breast-level predictions. We validate M&M's detection and classification performance using five mammography datasets. In addition, we demonstrate the effectiveness of each proposed component through comprehensive ablation studies.

* MICCAI 2023 with supplementary materials 
Viaarxiv icon

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

Aug 11, 2023
Kun Li, Dan Guo, Meng Wang

The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.

* This paper has been accepted by SCIENCE CHINA Information Sciences 
Viaarxiv icon

Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

Aug 03, 2023
Kun Li, Dan Guo, Guoliang Chen, Feiyang Liu, Meng Wang

Figure 1 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
Figure 2 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
Figure 3 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
Figure 4 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.

* Solutions of HFUT-VUT Team at the ACM MM 2023 Grand Challenge (MultiMediate: Multi-modal Behaviour Analysis for Artificial Mediation). Accepted at ACM MM 2023 
Viaarxiv icon

Joint Skeletal and Semantic Embedding Loss for Micro-gesture Classification

Jul 20, 2023
Kun Li, Dan Guo, Guoliang Chen, Xinge Peng, Meng Wang

In this paper, we briefly introduce the solution of our team HFUT-VUT for the Micros-gesture Classification in the MiGA challenge at IJCAI 2023. The micro-gesture classification task aims at recognizing the action category of a given video based on the skeleton data. For this task, we propose a 3D-CNNs-based micro-gesture recognition network, which incorporates a skeletal and semantic embedding loss to improve action classification performance. Finally, we rank 1st in the Micro-gesture Classification Challenge, surpassing the second-place team in terms of Top-1 accuracy by 1.10%.

* 1st Place in Micro-gesture Classification sub-challenge in MiGA at IJCAI-2023 
Viaarxiv icon

Improving Audio-Visual Video Parsing with Pseudo Visual Labels

Mar 04, 2023
Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang

Figure 1 for Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Figure 2 for Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Figure 3 for Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Figure 4 for Improving Audio-Visual Video Parsing with Pseudo Visual Labels

Audio-Visual Video Parsing is a task to predict the events that occur in video segments for each modality. It often performs in a weakly supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known event labels for each modality. However, the labels are still limited to the video level, and the temporal boundaries of event timestamps remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the CLIP model to estimate the events in each video segment based on visual modality to generate segment-level pseudo labels. A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness. A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs. We perform extensive experiments on the LLP dataset and demonstrate that our method can generate high-quality segment-level pseudo labels with the help of our newly proposed loss and the label denoising strategy. Our method achieves state-of-the-art audio-visual video parsing performance.

Viaarxiv icon

Audio-Visual Segmentation with Semantics

Jan 30, 2023
Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong

Figure 1 for Audio-Visual Segmentation with Semantics
Figure 2 for Audio-Visual Segmentation with Semantics
Figure 3 for Audio-Visual Segmentation with Semantics
Figure 4 for Audio-Visual Segmentation with Semantics

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.

* Submitted to TPAMI as a journal extension of ECCV 2022. Jinxing Zhou, Xuyang Shen, and Jianyuan Wang contribute equally to this work. Meng Wang and Yiran Zhong are the corresponding authors. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn. arXiv admin note: substantial text overlap with arXiv:2207.05042 
Viaarxiv icon

Contrastive Positive Sample Propagation along the Audio-Visual Event Line

Nov 18, 2022
Jinxing Zhou, Dan Guo, Meng Wang

Figure 1 for Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Figure 2 for Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Figure 3 for Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Figure 4 for Contrastive Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSA$_S$ and PSA$_V$). Three new contrastive objectives are proposed (\emph{i.e.}, $\mathcal{L}_{\text{avpsp}}$, $\mathcal{L}_\text{spsa}$, and $\mathcal{L}_\text{vpsa}$) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.

* Accepted to TPAMI; Dataset and Code are available at https://github.com/jasongief/CPSP. arXiv admin note: substantial text overlap with arXiv:2104.00239 
Viaarxiv icon