Picture for Zengrui Jin

Zengrui Jin

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Add code
Nov 14, 2025
Figure 1 for Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Figure 2 for Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Figure 3 for Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Figure 4 for Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Viaarxiv icon

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Add code
Oct 29, 2025
Figure 1 for SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
Figure 2 for SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
Figure 3 for SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
Figure 4 for SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
Viaarxiv icon

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision

Add code
May 27, 2025
Viaarxiv icon

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models

Add code
May 27, 2025
Figure 1 for Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
Figure 2 for Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
Figure 3 for Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
Figure 4 for Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
Viaarxiv icon

Effective and Efficient Mixed Precision Quantization of Speech Foundation Models

Add code
Jan 07, 2025
Figure 1 for Effective and Efficient Mixed Precision Quantization of Speech Foundation Models
Figure 2 for Effective and Efficient Mixed Precision Quantization of Speech Foundation Models
Figure 3 for Effective and Efficient Mixed Precision Quantization of Speech Foundation Models
Figure 4 for Effective and Efficient Mixed Precision Quantization of Speech Foundation Models
Viaarxiv icon

Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition

Add code
Dec 25, 2024
Figure 1 for Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition
Figure 2 for Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition
Figure 3 for Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition
Figure 4 for Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition
Viaarxiv icon

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Add code
Nov 26, 2024
Viaarxiv icon

CR-CTC: Consistency regularization on CTC for improved speech recognition

Add code
Oct 07, 2024
Figure 1 for CR-CTC: Consistency regularization on CTC for improved speech recognition
Figure 2 for CR-CTC: Consistency regularization on CTC for improved speech recognition
Figure 3 for CR-CTC: Consistency regularization on CTC for improved speech recognition
Figure 4 for CR-CTC: Consistency regularization on CTC for improved speech recognition
Viaarxiv icon

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Add code
Sep 01, 2024
Figure 1 for LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Figure 2 for LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Figure 3 for LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Figure 4 for LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Viaarxiv icon

Advancing Multi-talker ASR Performance with Large Language Models

Add code
Aug 30, 2024
Figure 1 for Advancing Multi-talker ASR Performance with Large Language Models
Figure 2 for Advancing Multi-talker ASR Performance with Large Language Models
Figure 3 for Advancing Multi-talker ASR Performance with Large Language Models
Figure 4 for Advancing Multi-talker ASR Performance with Large Language Models
Viaarxiv icon