Picture for Ali Vosoughi

Ali Vosoughi

Can Sound Replace Vision in LLaVA With Token Substitution?

Add code
Jun 12, 2025
Viaarxiv icon

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Add code
May 26, 2025
Viaarxiv icon

$I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion

Add code
May 22, 2025
Viaarxiv icon

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Add code
Apr 09, 2025
Viaarxiv icon

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Add code
Mar 14, 2025
Viaarxiv icon

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Add code
Mar 12, 2025
Viaarxiv icon

Enhancing Graph Attention Neural Network Performance for Marijuana Consumption Classification through Large-scale Augmented Granger Causality (lsAGC) Analysis of Functional MR Images

Add code
Oct 24, 2024
Viaarxiv icon

EAGLE: Egocentric AGgregated Language-video Engine

Add code
Sep 26, 2024
Figure 1 for EAGLE: Egocentric AGgregated Language-video Engine
Figure 2 for EAGLE: Egocentric AGgregated Language-video Engine
Figure 3 for EAGLE: Egocentric AGgregated Language-video Engine
Figure 4 for EAGLE: Egocentric AGgregated Language-video Engine
Viaarxiv icon

OSCaR: Object State Captioning and State Change Representation

Add code
Feb 28, 2024
Figure 1 for OSCaR: Object State Captioning and State Change Representation
Figure 2 for OSCaR: Object State Captioning and State Change Representation
Figure 3 for OSCaR: Object State Captioning and State Change Representation
Figure 4 for OSCaR: Object State Captioning and State Change Representation
Viaarxiv icon

Learning Audio Concepts from Counterfactual Natural Language

Add code
Jan 10, 2024
Viaarxiv icon