Picture for Ruibo Fu

Ruibo Fu

LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Add code
Nov 24, 2024
Figure 1 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
Figure 2 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
Figure 3 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
Figure 4 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis
Viaarxiv icon

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

Add code
Sep 18, 2024
Figure 1 for DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
Figure 2 for DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
Figure 3 for DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
Figure 4 for DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
Viaarxiv icon

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Add code
Sep 18, 2024
Figure 1 for Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0
Figure 2 for Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0
Figure 3 for Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0
Figure 4 for Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0
Viaarxiv icon

Towards Diverse and Efficient Audio Captioning via Diffusion Models

Add code
Sep 14, 2024
Viaarxiv icon

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

Add code
Sep 14, 2024
Figure 1 for Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
Figure 2 for Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
Figure 3 for Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
Figure 4 for Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
Viaarxiv icon

EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech

Add code
Aug 20, 2024
Figure 1 for EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
Figure 2 for EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
Figure 3 for EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
Figure 4 for EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech
Viaarxiv icon

Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

Add code
Aug 20, 2024
Figure 1 for Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
Figure 2 for Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
Figure 3 for Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
Figure 4 for Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
Viaarxiv icon

A Noval Feature via Color Quantisation for Fake Audio Detection

Add code
Aug 20, 2024
Figure 1 for A Noval Feature via Color Quantisation for Fake Audio Detection
Figure 2 for A Noval Feature via Color Quantisation for Fake Audio Detection
Figure 3 for A Noval Feature via Color Quantisation for Fake Audio Detection
Figure 4 for A Noval Feature via Color Quantisation for Fake Audio Detection
Viaarxiv icon

Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

Add code
Aug 13, 2024
Figure 1 for Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
Figure 2 for Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
Figure 3 for Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
Figure 4 for Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
Viaarxiv icon

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Add code
Aug 11, 2024
Viaarxiv icon