Picture for Jun Song

Jun Song

AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Add code
Dec 24, 2025
Viaarxiv icon

Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS

Add code
Nov 13, 2025
Viaarxiv icon

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Add code
Aug 28, 2025
Figure 1 for MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Figure 2 for MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Figure 3 for MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Figure 4 for MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Viaarxiv icon

InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Add code
Aug 27, 2025
Figure 1 for InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Figure 2 for InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Figure 3 for InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Figure 4 for InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Viaarxiv icon

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Add code
Aug 25, 2025
Figure 1 for Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Figure 2 for Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Figure 3 for Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Figure 4 for Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Viaarxiv icon

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Add code
Aug 07, 2025
Viaarxiv icon

Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding

Add code
May 30, 2025
Viaarxiv icon

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

Add code
Apr 29, 2025
Viaarxiv icon

GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

Add code
Apr 17, 2025
Figure 1 for GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Figure 2 for GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Figure 3 for GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Figure 4 for GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Viaarxiv icon

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

Add code
Mar 12, 2025
Figure 1 for CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Figure 2 for CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Figure 3 for CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Figure 4 for CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Viaarxiv icon