Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aristeidis Papadopoulos

VisG AV-HuBERT: Viseme-Guided AV-HuBERT

Apr 01, 2026

Aristeidis Papadopoulos, Rishabh Jain, Naomi Harte

Abstract:Audio-Visual Speech Recognition (AVSR) systems nowadays integrate Large Language Model (LLM) decoders with transformer-based encoders, achieving state-of-the-art results. However, the relative contributions of improved language modelling versus enhanced audiovisual encoding remain unclear. We propose Viseme-Guided AV-HuBERT (VisG AV-HuBERT), a multi-task fine-tuning framework that incorporates auxiliary viseme classification to strengthen the model's reliance on visual articulatory features. By extending AV-HuBERT with a lightweight viseme prediction sub-network, this method explicitly guides the encoder to preserve visual speech information. Evaluated on LRS3, VisG AV-HuBERT achieves comparable or improved performance over the baseline AV-HuBERT, with notable gains under heavy noise conditions. WER reduces from 13.59% to 6.60% (51.4% relative improvement) at -10 dB Signal-to-Noise Ratio (SNR) for Speech noise. Deeper analysis reveals substantial reductions in substitution errors across noise types, demonstrating improved speech unit discrimination. Evaluation on LRS2 confirms generalization capability. Our results demonstrate that explicit viseme modelling enhances encoder representations, and provides a foundation for enhancing noise-robust AVSR through encoder-level improvements.

* Includes Supplementary Material. Accepted for Publication at International Conference on Pattern Recognition 2026 - ICPR 2026. Code is available at https://github.com/aristosp/visg_avhubert

Via

Access Paper or Ask Questions

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Sep 19, 2025

Aristeidis Papadopoulos, Naomi Harte

Figure 1 for Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Figure 2 for Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Figure 3 for Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Figure 4 for Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Abstract:Audio-Visual Speech Recognition (AVSR) models have surpassed their audio-only counterparts in terms of performance. However, the interpretability of AVSR systems, particularly the role of the visual modality, remains under-explored. In this paper, we apply several interpretability techniques to examine how visemes are encoded in AV-HuBERT a state-of-the-art AVSR model. First, we use t-distributed Stochastic Neighbour Embedding (t-SNE) to visualize learned features, revealing natural clustering driven by visual cues, which is further refined by the presence of audio. Then, we employ probing to show how audio contributes to refining feature representations, particularly for visemes that are visually ambiguous or under-represented. Our findings shed light on the interplay between modalities in AVSR and could point to new strategies for leveraging visual information to improve AVSR performance.

* Accepted into Automatic Speech Recognition and Understanding- ASRU 2025

Via

Access Paper or Ask Questions