Alert button
Picture for Kun Li

Kun Li

Alert button

Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding

Sep 12, 2023
Jiaxiu Li, Kun Li, Jia Li, Guoliang Chen, Dan Guo, Meng Wang

Make-up temporal video grounding (MTVG) aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video. Compared with the general video grounding task, MTVG focuses on meticulous actions and changes on the face. The make-up instruction step, usually involving detailed differences in products and facial areas, is more fine-grained than general activities (e.g, cooking activity and furniture assembly). Thus, existing general approaches cannot locate the target activity effectually. More specifically, existing proposal generation modules are not yet fully developed in providing semantic cues for the more fine-grained make-up semantic comprehension. To tackle this issue, we propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic details of make-up activities. DPTMO extracts both query-agnostic and query-guided features to construct two proposal sets and uses specific evaluation methods for the two sets. Different from the commonly used single structure in previous methods, our dual-path structure can mine more semantic information in make-up videos and distinguish fine-grained actions well. These two candidate sets represent the cross-modal makeup video-text similarity and multi-modal fusion relationship, complementing each other. Each set corresponds to its respective optimization perspective, and their joint prediction enhances the accuracy of video timestamp prediction. Comprehensive experiments on the YouMakeup dataset demonstrate our proposed dual structure excels in fine-grained semantic comprehension.

Viaarxiv icon

FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection

Sep 11, 2023
Chunyong Hu, Hang Zheng, Kun Li, Jianyun Xu, Weibo Mao, Maochun Luo, Lingxuan Wang, Mingxia Chen, Kaixuan Liu, Yiru Zhao, Peihan Hao, Minzhe Liu, Kaicheng Yu

Figure 1 for FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection
Figure 2 for FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection
Figure 3 for FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection
Figure 4 for FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection

Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features through a simple channel concatenation require transformation features into bird's eye view space and may lose the information on Z-axis thus leads to inferior performance. To this end, we propose FusionFormer, an end-to-end multi-modal fusion framework that leverages transformers to fuse multi-modal features and obtain fused BEV features. And based on the flexible adaptability of FusionFormer to the input modality representation, we propose a depth prediction branch that can be added to the framework to improve detection performance in camera-based detection tasks. In addition, we propose a plug-and-play temporal fusion module based on transformers that can fuse historical frame BEV features for more stable and reliable detection results. We evaluate our method on the nuScenes dataset and achieve 72.6% mAP and 75.1% NDS for 3D object detection tasks, outperforming state-of-the-art methods.

Viaarxiv icon

Exploiting Diverse Feature for Multimodal Sentiment Analysis

Aug 25, 2023
Jia Li, Wei Qian, Kun Li, Qi Li, Dan Guo, Meng Wang

Figure 1 for Exploiting Diverse Feature for Multimodal Sentiment Analysis
Figure 2 for Exploiting Diverse Feature for Multimodal Sentiment Analysis
Figure 3 for Exploiting Diverse Feature for Multimodal Sentiment Analysis
Figure 4 for Exploiting Diverse Feature for Multimodal Sentiment Analysis

In this paper, we present our solution to the MuSe-Personalisation sub-challenge in the MuSe 2023 Multimodal Sentiment Analysis Challenge. The task of MuSe-Personalisation aims to predict the continuous arousal and valence values of a participant based on their audio-visual, language, and physiological signal modalities data. Considering different people have personal characteristics, the main challenge of this task is how to build robustness feature presentation for sentiment prediction. To address this issue, we propose exploiting diverse features. Specifically, we proposed a series of feature extraction methods to build a robust representation and model ensemble. We empirically evaluate the performance of the utilized method on the officially provided dataset. \textbf{As a result, we achieved 3rd place in the MuSe-Personalisation sub-challenge.} Specifically, we achieve the results of 0.8492 and 0.8439 for MuSe-Personalisation in terms of arousal and valence CCC.

Viaarxiv icon

Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

Aug 15, 2023
Wei Qian, Dan Guo, Kun Li, Xilan Tian, Meng Wang

Figure 1 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos
Figure 2 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos
Figure 3 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos
Figure 4 for Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos

Remote photoplethysmography (rPPG) based physiological measurement is an emerging yet crucial vision task, whose challenge lies in exploring accurate rPPG prediction from facial videos accompanied by noises of illumination variations, facial occlusions, head movements, \etc, in a non-contact manner. Existing mainstream CNN-based models make efforts to detect physiological signals by capturing subtle color changes in facial regions of interest (ROI) caused by heartbeats. However, such models are constrained by the limited local spatial or temporal receptive fields in the neural units. Unlike them, a native Transformer-based framework called Dual-path TokenLearner (Dual-TL) is proposed in this paper, which utilizes the concept of learnable tokens to integrate both spatial and temporal informative contexts from the global perspective of the video. Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances. Complementarily, a Temporal TokenLearner (T-TL) is designed to infer the quasi-periodic pattern of heartbeats, which eliminates temporal disturbances such as head movements. The two TokenLearners, S-TL and T-TL, are executed in a dual-path mode. This enables the model to reduce noise disturbances for final rPPG signal prediction. Extensive experiments on four physiological measurement benchmark datasets are conducted. The Dual-TL achieves state-of-the-art performances in both intra- and cross-dataset testings, demonstrating its immense potential as a basic backbone for rPPG measurement. The source code is available at \href{https://github.com/VUT-HFUT/Dual-TL}{https://github.com/VUT-HFUT/Dual-TL}

Viaarxiv icon

FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving

Aug 14, 2023
Tengju Ye, Wei Jing, Chunyong Hu, Shikun Huang, Lingping Gao, Fangzhen Li, Jingke Wang, Ke Guo, Wencong Xiao, Weibo Mao, Hang Zheng, Kun Li, Junbo Chen, Kaicheng Yu

Figure 1 for FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving
Figure 2 for FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving
Figure 3 for FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving
Figure 4 for FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving

Building a multi-modality multi-task neural network toward accurate and robust performance is a de-facto standard in perception task of autonomous driving. However, leveraging such data from multiple sensors to jointly optimize the prediction and planning tasks remains largely unexplored. In this paper, we present FusionAD, to the best of our knowledge, the first unified framework that fuse the information from two most critical sensors, camera and LiDAR, goes beyond perception task. Concretely, we first build a transformer based multi-modality fusion network to effectively produce fusion based features. In constrast to camera-based end-to-end method UniAD, we then establish a fusion aided modality-aware prediction and status-aware planning modules, dubbed FMSPnP that take advantages of multi-modality features. We conduct extensive experiments on commonly used benchmark nuScenes dataset, our FusionAD achieves state-of-the-art performance and surpassing baselines on average 15% on perception tasks like detection and tracking, 10% on occupancy prediction accuracy, reducing prediction error from 0.708 to 0.389 in ADE score and reduces the collision rate from 0.31% to only 0.12%.

Viaarxiv icon

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

Aug 11, 2023
Kun Li, Dan Guo, Meng Wang

The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.

* This paper has been accepted by SCIENCE CHINA Information Sciences 
Viaarxiv icon

Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

Aug 03, 2023
Kun Li, Dan Guo, Guoliang Chen, Feiyang Liu, Meng Wang

Figure 1 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
Figure 2 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
Figure 3 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
Figure 4 for Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.

* Solutions of HFUT-VUT Team at the ACM MM 2023 Grand Challenge (MultiMediate: Multi-modal Behaviour Analysis for Artificial Mediation). Accepted at ACM MM 2023 
Viaarxiv icon