Picture for Yuan Gong

Yuan Gong

xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

Add code
Jun 16, 2025
Viaarxiv icon

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Add code
May 02, 2025
Viaarxiv icon

Can Diffusion Models Disentangle? A Theoretical Perspective

Add code
Mar 31, 2025
Viaarxiv icon

State-Space Large Audio Language Models

Add code
Nov 24, 2024
Viaarxiv icon

A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation

Add code
Oct 29, 2024
Figure 1 for A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
Figure 2 for A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
Figure 3 for A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
Viaarxiv icon

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models

Add code
Sep 26, 2024
Figure 1 for AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models
Figure 2 for AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models
Figure 3 for AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models
Figure 4 for AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models
Viaarxiv icon

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Add code
Sep 17, 2024
Figure 1 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Figure 2 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Figure 3 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Figure 4 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Viaarxiv icon

DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Add code
Jul 04, 2024
Viaarxiv icon

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

Add code
Jun 26, 2024
Figure 1 for Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Figure 2 for Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Figure 3 for Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Figure 4 for Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Viaarxiv icon

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Add code
Jun 14, 2024
Figure 1 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 2 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 3 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Figure 4 for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Viaarxiv icon