speech


SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation

Add code
Sep 26, 2025
Viaarxiv icon

Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

Add code
Sep 19, 2025
Viaarxiv icon

Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning

Add code
Sep 19, 2025
Viaarxiv icon

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Add code
Sep 19, 2025
Viaarxiv icon

Compose Yourself: Average-Velocity Flow Matching for One-Step Speech Enhancement

Add code
Sep 19, 2025
Viaarxiv icon

DISPATCH: Distilling Selective Patches for Speech Enhancement

Add code
Sep 19, 2025
Viaarxiv icon

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

Add code
Sep 19, 2025
Viaarxiv icon

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition

Add code
Sep 19, 2025
Viaarxiv icon

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS

Add code
Sep 19, 2025
Figure 1 for Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
Figure 2 for Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
Figure 3 for Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
Figure 4 for Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
Viaarxiv icon

The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

Add code
Sep 19, 2025
Viaarxiv icon