Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shenghui Zhao

MSCT: Differential Cross-Modal Attention for Deepfake Detection

Apr 09, 2026

Fangda Wei, Miao Liu, Yingxue Wang, Jing Wang, Shenghui Zhao, Nan Li

Abstract:Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

* Accpeted by ICASSP2026

Via

Access Paper or Ask Questions

Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing

Apr 07, 2026

Zhiyu Li, Xinwen Yue, Shenghui Zhao, Jing Wang

Abstract:We propose a multimodal deep learning model for VR auralization that generates spatial room impulse responses (SRIRs) in real time to reconstruct scene-specific auditory perception. Employing SRIRs as the output reduces computational complexity and facilitates integration with personalized head-related transfer functions. The model takes two modalities as input: scene information and waveforms, where the waveform corresponds to the low-order reflections (LoR). LoR can be efficiently computed using geometrical acoustics (GA) but remains difficult for deep learning models to predict accurately. Scene geometry, acoustic properties, source coordinates, and listener coordinates are first used to compute LoR in real time via GA, and both LoR and these features are subsequently provided as inputs to the model. A new dataset was constructed, consisting of multiple scenes and their corresponding SRIRs. The dataset exhibits greater diversity. Experimental results demonstrate the superior performance of the proposed model.

* This work was accepted by ICASSP 2026

Via

Access Paper or Ask Questions

A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction

Sep 26, 2021

Yubo An, Shenghui Zhao

Figure 1 for A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction

Figure 2 for A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction

Figure 3 for A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction

Figure 4 for A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction

Abstract:In this paper, a Video Summarization Method using Temporal Interest Detection and Key Frame Prediction is proposed for supervised video summarization, where video summarization is formulated as a combination of sequence labeling and temporal interest detection problem. In our method, we firstly built a flexible universal network frame to simultaneously predicts frame-level importance scores and temporal interest segments, and then combine the two components with different weights to achieve a more detailed video summarization. Extensive experiments and analysis on two benchmark datasets prove the effectiveness of our method. Specifically, compared with other state-of-the-art methods, its performance is increased by at least 2.6% and 4.2% on TVSum and SumMe respectively.

Via

Access Paper or Ask Questions