Picture for Zhisheng Zheng

Zhisheng Zheng

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Add code
Jan 14, 2026
Viaarxiv icon

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Add code
Nov 15, 2025
Viaarxiv icon

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Add code
May 19, 2025
Figure 1 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 2 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 3 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Figure 4 for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Viaarxiv icon

Scaling Rich Style-Prompted Text-to-Speech Datasets

Add code
Mar 06, 2025
Viaarxiv icon

DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning

Add code
Oct 12, 2024
Figure 1 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 2 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 3 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 4 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Viaarxiv icon

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Add code
Oct 12, 2024
Figure 1 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 2 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 3 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 4 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Viaarxiv icon

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Add code
Jun 11, 2024
Viaarxiv icon

BAT: Learning to Reason about Spatial Sounds with Large Language Models

Add code
Feb 02, 2024
Viaarxiv icon

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Add code
Jan 07, 2024
Figure 1 for EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Figure 2 for EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Figure 3 for EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Figure 4 for EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Viaarxiv icon

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Add code
Dec 23, 2023
Figure 1 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Figure 2 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Figure 3 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Figure 4 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Viaarxiv icon