Picture for Ge Zhu

Ge Zhu

A Generative-First Neural Audio Autoencoder

Add code
Feb 17, 2026
Viaarxiv icon

Stemphonic: All-at-once Flexible Multi-stem Music Generation

Add code
Feb 10, 2026
Viaarxiv icon

A Review on Score-based Generative Models for Audio Applications

Add code
Jun 10, 2025
Viaarxiv icon

ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

Add code
Feb 13, 2025
Figure 1 for ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Figure 2 for ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Figure 3 for ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Figure 4 for ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Viaarxiv icon

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Add code
Nov 28, 2024
Viaarxiv icon

Presto! Distilling Steps and Layers for Accelerating Music Generation

Add code
Oct 07, 2024
Figure 1 for Presto! Distilling Steps and Layers for Accelerating Music Generation
Figure 2 for Presto! Distilling Steps and Layers for Accelerating Music Generation
Figure 3 for Presto! Distilling Steps and Layers for Accelerating Music Generation
Figure 4 for Presto! Distilling Steps and Layers for Accelerating Music Generation
Viaarxiv icon

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Add code
Aug 13, 2024
Figure 1 for Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
Figure 2 for Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
Figure 3 for Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
Figure 4 for Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
Viaarxiv icon

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Add code
Mar 20, 2024
Figure 1 for MusicHiFi: Fast High-Fidelity Stereo Vocoding
Figure 2 for MusicHiFi: Fast High-Fidelity Stereo Vocoding
Figure 3 for MusicHiFi: Fast High-Fidelity Stereo Vocoding
Figure 4 for MusicHiFi: Fast High-Fidelity Stereo Vocoding
Viaarxiv icon

Cacophony: An Improved Contrastive Audio-Text Model

Add code
Feb 10, 2024
Figure 1 for Cacophony: An Improved Contrastive Audio-Text Model
Figure 2 for Cacophony: An Improved Contrastive Audio-Text Model
Figure 3 for Cacophony: An Improved Contrastive Audio-Text Model
Figure 4 for Cacophony: An Improved Contrastive Audio-Text Model
Viaarxiv icon

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Add code
Nov 18, 2023
Figure 1 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Figure 2 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Figure 3 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Figure 4 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Viaarxiv icon