Picture for Xihan Wei

Xihan Wei

CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Add code
May 06, 2025
Viaarxiv icon

ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding

Add code
Apr 25, 2025
Figure 1 for ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Figure 2 for ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Figure 3 for ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Figure 4 for ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Viaarxiv icon

ViSpeak: Visual Instruction Feedback in Streaming Videos

Add code
Mar 17, 2025
Figure 1 for ViSpeak: Visual Instruction Feedback in Streaming Videos
Figure 2 for ViSpeak: Visual Instruction Feedback in Streaming Videos
Figure 3 for ViSpeak: Visual Instruction Feedback in Streaming Videos
Figure 4 for ViSpeak: Visual Instruction Feedback in Streaming Videos
Viaarxiv icon

A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection

Add code
Mar 13, 2025
Figure 1 for A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
Figure 2 for A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
Figure 3 for A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
Figure 4 for A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
Viaarxiv icon

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Add code
Mar 07, 2025
Viaarxiv icon

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Add code
Jan 31, 2025
Figure 1 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 2 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 3 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Figure 4 for LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Viaarxiv icon

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding

Add code
Jan 25, 2025
Figure 1 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 2 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 3 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Figure 4 for HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Viaarxiv icon

Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

Add code
Jan 16, 2025
Figure 1 for Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis
Figure 2 for Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis
Figure 3 for Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis
Figure 4 for Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis
Viaarxiv icon

Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

Add code
Jan 14, 2025
Figure 1 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 2 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 3 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Figure 4 for Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Viaarxiv icon

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

Add code
Jan 09, 2025
Figure 1 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 2 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 3 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Figure 4 for LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Viaarxiv icon