Alert button
Picture for Hanyi Zhang

Hanyi Zhang

Alert button

Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Feb 22, 2023
Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang

Figure 1 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification
Figure 2 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification
Figure 3 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification
Figure 4 for Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.

Viaarxiv icon

Diversifying Agent's Behaviors in Interactive Decision Models

Mar 06, 2022
Yinghui Pan, Hanyi Zhang, Yifeng Zeng, Biyang Ma, Jing Tang, Zhong Ming

Figure 1 for Diversifying Agent's Behaviors in Interactive Decision Models
Figure 2 for Diversifying Agent's Behaviors in Interactive Decision Models
Figure 3 for Diversifying Agent's Behaviors in Interactive Decision Models

Modelling other agents' behaviors plays an important role in decision models for interactions among multiple agents. To optimise its own decisions, a subject agent needs to model what other agents act simultaneously in an uncertain environment. However, modelling insufficiency occurs when the agents are competitive and the subject agent can not get full knowledge about other agents. Even when the agents are collaborative, they may not share their true behaviors due to their privacy concerns. In this article, we investigate into diversifying behaviors of other agents in the subject agent's decision model prior to their interactions. Starting with prior knowledge about other agents' behaviors, we use a linear reduction technique to extract representative behavioral features from the known behaviors. We subsequently generate their new behaviors by expanding the features and propose two diversity measurements to select top-K behaviors. We demonstrate the performance of the new techniques in two well-studied problem domains. This research will contribute to intelligent systems dealing with unknown unknowns in an open artificial intelligence world.

* 19 pages, 15 figures 
Viaarxiv icon

Exploring Deep Learning for Joint Audio-Visual Lip Biometrics

Apr 17, 2021
Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang

Figure 1 for Exploring Deep Learning for Joint Audio-Visual Lip Biometrics
Figure 2 for Exploring Deep Learning for Joint Audio-Visual Lip Biometrics
Figure 3 for Exploring Deep Learning for Joint Audio-Visual Lip Biometrics
Figure 4 for Exploring Deep Learning for Joint Audio-Visual Lip Biometrics

Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a moderate-size database using existing public databases. Meanwhile, we establish the DeepLip AV lip biometrics system realized with a convolutional neural network (CNN) based video module, a time-delay neural network (TDNN) based audio module, and a multimodal fusion module. Our experiments show that DeepLip outperforms traditional speaker recognition models in context modeling and achieves over 50% relative improvements compared with our best single modality baseline, with an equal error rate of 0.75% and 1.11% on the test datasets, respectively.

Viaarxiv icon