Picture for Chao Zhang

Chao Zhang

refer to the report for detailed contributions

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

Add code
Oct 09, 2024
Figure 1 for Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Figure 2 for Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Figure 3 for Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Figure 4 for Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Viaarxiv icon

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Add code
Oct 07, 2024
Figure 1 for Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Figure 2 for Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Figure 3 for Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Viaarxiv icon

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

Add code
Oct 04, 2024
Figure 1 for LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Figure 2 for LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Figure 3 for LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Figure 4 for LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Viaarxiv icon

SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding

Add code
Sep 30, 2024
Figure 1 for SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding
Figure 2 for SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding
Figure 3 for SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding
Figure 4 for SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding
Viaarxiv icon

LW2G: Learning Whether to Grow for Prompt-based Continual Learning

Add code
Sep 27, 2024
Viaarxiv icon

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Add code
Sep 25, 2024
Figure 1 for Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Figure 2 for Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Figure 3 for Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Figure 4 for Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Viaarxiv icon

MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events

Add code
Sep 25, 2024
Figure 1 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
Figure 2 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
Figure 3 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
Figure 4 for MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
Viaarxiv icon

Beyond Relevance: Improving User Engagement by Personalization for Short-Video Search

Add code
Sep 17, 2024
Figure 1 for Beyond Relevance: Improving User Engagement by Personalization for Short-Video Search
Figure 2 for Beyond Relevance: Improving User Engagement by Personalization for Short-Video Search
Figure 3 for Beyond Relevance: Improving User Engagement by Personalization for Short-Video Search
Figure 4 for Beyond Relevance: Improving User Engagement by Personalization for Short-Video Search
Viaarxiv icon

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Add code
Sep 17, 2024
Figure 1 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Figure 2 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Figure 3 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Figure 4 for Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Viaarxiv icon

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Add code
Sep 15, 2024
Viaarxiv icon