Picture for Shilong Liu

Shilong Liu

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Add code
May 29, 2025
Viaarxiv icon

Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution

Add code
May 26, 2025
Viaarxiv icon

On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Add code
May 26, 2025
Viaarxiv icon

A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Add code
Feb 22, 2025
Figure 1 for A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
Figure 2 for A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
Figure 3 for A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
Figure 4 for A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
Viaarxiv icon

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Add code
Nov 27, 2024
Viaarxiv icon

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Add code
Nov 21, 2024
Figure 1 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
Figure 2 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
Figure 3 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
Figure 4 for DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
Viaarxiv icon

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Add code
Oct 29, 2024
Figure 1 for Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Figure 2 for Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Figure 3 for Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Viaarxiv icon

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

Add code
Jul 23, 2024
Figure 1 for TAPTRv2: Attention-based Position Update Improves Tracking Any Point
Figure 2 for TAPTRv2: Attention-based Position Update Improves Tracking Any Point
Figure 3 for TAPTRv2: Attention-based Position Update Improves Tracking Any Point
Figure 4 for TAPTRv2: Attention-based Position Update Improves Tracking Any Point
Viaarxiv icon

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

Add code
Jul 02, 2024
Figure 1 for MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Figure 2 for MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Figure 3 for MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Figure 4 for MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Viaarxiv icon

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Add code
Jul 01, 2024
Figure 1 for CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Figure 2 for CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Figure 3 for CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Figure 4 for CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Viaarxiv icon