Picture for Yong Man Ro

Yong Man Ro

DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Add code
May 29, 2025
Viaarxiv icon

Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Add code
May 29, 2025
Viaarxiv icon

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

Add code
Mar 14, 2025
Viaarxiv icon

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Add code
Mar 08, 2025
Viaarxiv icon

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Add code
Dec 30, 2024
Viaarxiv icon

Long-Form Speech Generation with Spoken Language Models

Add code
Dec 24, 2024
Viaarxiv icon

Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor

Add code
Dec 23, 2024
Viaarxiv icon

AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues

Add code
Dec 23, 2024
Viaarxiv icon

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Add code
Dec 02, 2024
Viaarxiv icon

Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

Add code
Nov 29, 2024
Viaarxiv icon