Picture for Zejun Ma

Zejun Ma

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Add code
Jul 10, 2024
Viaarxiv icon

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Add code
Jun 22, 2024
Viaarxiv icon

Can Large Language Models Understand Spatial Audio?

Add code
Jun 12, 2024
Viaarxiv icon

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Add code
Mar 04, 2024
Figure 1 for SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
Figure 2 for SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
Figure 3 for SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
Figure 4 for SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
Viaarxiv icon

SLIT: Boosting Audio-Text Pre-Training via Multi-Stage Learning and Instruction Tuning

Add code
Feb 20, 2024
Viaarxiv icon

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Add code
Jan 20, 2024
Figure 1 for Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
Figure 2 for Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
Figure 3 for Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
Figure 4 for Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
Viaarxiv icon

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Add code
Nov 15, 2023
Figure 1 for Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Figure 2 for Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Figure 3 for Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Figure 4 for Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Viaarxiv icon

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Add code
Oct 20, 2023
Figure 1 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Figure 2 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Figure 3 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Figure 4 for SALMONN: Towards Generic Hearing Abilities for Large Language Models
Viaarxiv icon

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Add code
Oct 10, 2023
Viaarxiv icon

Connecting Speech Encoder and Large Language Model for ASR

Add code
Sep 26, 2023
Viaarxiv icon