Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Apr 01, 2025

Bingxin Li

Figure 1 for SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Figure 2 for SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Figure 3 for SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Figure 4 for SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Share this with someone who'll enjoy it:

Abstract:Multimodal models integrating speech and vision hold significant potential for advancing human-computer interaction, particularly in Speech-Based Visual Question Answering (SBVQA) where spoken questions about images require direct audio-visual understanding. Existing approaches predominantly focus on text-visual integration, leaving speech-visual modality gaps underexplored due to their inherent heterogeneity. To this end, we introduce SViQA, a unified speech-vision model that directly processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations: (1) end-to-end speech feature extraction eliminating intermediate text conversion, and (2) cross-modal alignment optimization enabling effective fusion of speech signals with visual content. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance, achieving 75.62% accuracy, and competitive multimodal generalization. Leveraging speech-text mixed input boosts performance to 78.85%, a 3.23% improvement over pure speech input, highlighting SViQA's enhanced robustness and effective cross-modal attention alignment.

View paper on

Share this with someone who'll enjoy it:

Title:SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Paper and Code