Picture for Fengyun Rao

Fengyun Rao

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Add code
Jul 30, 2025
Viaarxiv icon

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

Add code
Jun 09, 2025
Viaarxiv icon

Instruction-augmented Multimodal Alignment for Image-Text and Element Matching

Add code
Apr 16, 2025
Viaarxiv icon

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

Add code
Mar 26, 2025
Viaarxiv icon

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Add code
Mar 26, 2025
Figure 1 for Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Figure 2 for Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Figure 3 for Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Figure 4 for Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Viaarxiv icon

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Add code
Mar 13, 2025
Viaarxiv icon

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Add code
Mar 09, 2025
Viaarxiv icon

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Add code
Mar 04, 2025
Figure 1 for HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Figure 2 for HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Figure 3 for HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Figure 4 for HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Viaarxiv icon

Number it: Temporal Grounding Videos like Flipping Manga

Add code
Nov 15, 2024
Figure 1 for Number it: Temporal Grounding Videos like Flipping Manga
Figure 2 for Number it: Temporal Grounding Videos like Flipping Manga
Figure 3 for Number it: Temporal Grounding Videos like Flipping Manga
Figure 4 for Number it: Temporal Grounding Videos like Flipping Manga
Viaarxiv icon

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Add code
Oct 15, 2024
Viaarxiv icon