Picture for Shuhei Kurita

Shuhei Kurita

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Add code
Apr 09, 2026
Viaarxiv icon

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Add code
Apr 02, 2026
Viaarxiv icon

JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

Add code
Apr 01, 2026
Viaarxiv icon

EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Add code
Mar 31, 2026
Viaarxiv icon

HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching

Add code
Mar 28, 2026
Viaarxiv icon

PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Add code
Mar 17, 2026
Viaarxiv icon

From Dialogue to Execution: Mixture-of-Agents Assisted Interactive Planning for Behavior Tree-Based Long-Horizon Robot Execution

Add code
Mar 01, 2026
Viaarxiv icon

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Add code
Feb 18, 2026
Viaarxiv icon

STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models

Add code
Oct 26, 2025
Figure 1 for STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Figure 2 for STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Figure 3 for STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Figure 4 for STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Viaarxiv icon

Developing Vision-Language-Action Model from Egocentric Videos

Add code
Sep 26, 2025
Viaarxiv icon