Picture for Xuenan Xu

Xuenan Xu

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Add code
Mar 26, 2026
Viaarxiv icon

STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

Add code
Mar 19, 2026
Viaarxiv icon

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Add code
Mar 17, 2026
Viaarxiv icon

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Add code
Jan 13, 2026
Viaarxiv icon

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio

Add code
Mar 07, 2025
Figure 1 for MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
Figure 2 for MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
Figure 3 for MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
Figure 4 for MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
Viaarxiv icon

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

Add code
Dec 24, 2024
Figure 1 for Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Figure 2 for Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Figure 3 for Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Figure 4 for Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Viaarxiv icon

Unified Pathological Speech Analysis with Prompt Tuning

Add code
Nov 05, 2024
Figure 1 for Unified Pathological Speech Analysis with Prompt Tuning
Figure 2 for Unified Pathological Speech Analysis with Prompt Tuning
Figure 3 for Unified Pathological Speech Analysis with Prompt Tuning
Figure 4 for Unified Pathological Speech Analysis with Prompt Tuning
Viaarxiv icon

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Add code
Oct 12, 2024
Figure 1 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 2 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 3 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Figure 4 for SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Viaarxiv icon

DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning

Add code
Oct 12, 2024
Figure 1 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 2 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 3 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Figure 4 for DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning
Viaarxiv icon

Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Add code
Jul 19, 2024
Figure 1 for Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Figure 2 for Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Figure 3 for Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Figure 4 for Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Viaarxiv icon