Picture for Yong Man Ro

Yong Man Ro

Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

Add code
Jan 05, 2026
Viaarxiv icon

GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

Add code
Nov 15, 2025
Viaarxiv icon

Unified Reinforcement and Imitation Learning for Vision-Language Models

Add code
Oct 22, 2025
Figure 1 for Unified Reinforcement and Imitation Learning for Vision-Language Models
Figure 2 for Unified Reinforcement and Imitation Learning for Vision-Language Models
Figure 3 for Unified Reinforcement and Imitation Learning for Vision-Language Models
Figure 4 for Unified Reinforcement and Imitation Learning for Vision-Language Models
Viaarxiv icon

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Add code
Jun 18, 2025
Viaarxiv icon

Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Add code
May 29, 2025
Viaarxiv icon

DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Add code
May 29, 2025
Viaarxiv icon

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

Add code
Mar 14, 2025
Figure 1 for MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Figure 2 for MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Figure 3 for MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Figure 4 for MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Viaarxiv icon

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

Add code
Mar 08, 2025
Viaarxiv icon

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Add code
Dec 30, 2024
Figure 1 for Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Figure 2 for Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Figure 3 for Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Figure 4 for Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Viaarxiv icon

Long-Form Speech Generation with Spoken Language Models

Add code
Dec 24, 2024
Figure 1 for Long-Form Speech Generation with Spoken Language Models
Figure 2 for Long-Form Speech Generation with Spoken Language Models
Figure 3 for Long-Form Speech Generation with Spoken Language Models
Figure 4 for Long-Form Speech Generation with Spoken Language Models
Viaarxiv icon