speech recognition


Speech recognition is the task of identifying words spoken aloud, analyzing the voice and language, and accurately transcribing the words.

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Add code
Mar 09, 2026
Viaarxiv icon

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Add code
Mar 09, 2026
Viaarxiv icon

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Add code
Mar 10, 2026
Viaarxiv icon

Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

Add code
Mar 10, 2026
Viaarxiv icon

NLE: Non-autoregressive LLM-based ASR by Transcript Editing

Add code
Mar 09, 2026
Viaarxiv icon

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

Add code
Mar 09, 2026
Viaarxiv icon

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Add code
Mar 12, 2026
Viaarxiv icon

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Add code
Mar 09, 2026
Viaarxiv icon

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

Add code
Mar 11, 2026
Viaarxiv icon

Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Add code
Mar 08, 2026
Viaarxiv icon