Picture for Chong Deng

Chong Deng

FunAudio-ASR Technical Report

Add code
Sep 15, 2025
Viaarxiv icon

Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding

Add code
Sep 04, 2025
Viaarxiv icon

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Add code
Aug 08, 2025
Viaarxiv icon

OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment

Add code
Jun 11, 2025
Viaarxiv icon

Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization

Add code
May 20, 2025
Viaarxiv icon

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Add code
Jan 10, 2025
Figure 1 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Figure 2 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Figure 3 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Figure 4 for MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Viaarxiv icon

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Add code
Dec 13, 2024
Figure 1 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Figure 2 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Figure 3 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Figure 4 for CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Viaarxiv icon

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Add code
Oct 23, 2024
Figure 1 for OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Figure 2 for OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Figure 3 for OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Figure 4 for OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Viaarxiv icon

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Add code
Aug 19, 2024
Figure 1 for Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
Figure 2 for Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
Figure 3 for Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
Figure 4 for Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
Viaarxiv icon

Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Add code
Aug 01, 2024
Figure 1 for Multimodal Fusion and Coherence Modeling for Video Topic Segmentation
Figure 2 for Multimodal Fusion and Coherence Modeling for Video Topic Segmentation
Figure 3 for Multimodal Fusion and Coherence Modeling for Video Topic Segmentation
Figure 4 for Multimodal Fusion and Coherence Modeling for Video Topic Segmentation
Viaarxiv icon