Visual Understanding


Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Add code
Jun 12, 2025
Viaarxiv icon

MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

Add code
Jun 12, 2025
Viaarxiv icon

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Add code
Jun 12, 2025
Viaarxiv icon

DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding

Add code
Jun 11, 2025
Viaarxiv icon

Can Sound Replace Vision in LLaVA With Token Substitution?

Add code
Jun 12, 2025
Viaarxiv icon

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Add code
Jun 11, 2025
Viaarxiv icon

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

Add code
Jun 11, 2025
Viaarxiv icon

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Add code
Jun 12, 2025
Viaarxiv icon

PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis

Add code
Jun 12, 2025
Viaarxiv icon

SlotPi: Physics-informed Object-centric Reasoning Models

Add code
Jun 12, 2025
Viaarxiv icon