Picture for Shusheng Yang

Shusheng Yang

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Add code
Nov 06, 2025
Viaarxiv icon

Cambrian-S: Towards Spatial Supersensing in Video

Add code
Nov 06, 2025
Viaarxiv icon

VideoNSA: Native Sparse Attention Scales Video Understanding

Add code
Oct 02, 2025
Figure 1 for VideoNSA: Native Sparse Attention Scales Video Understanding
Figure 2 for VideoNSA: Native Sparse Attention Scales Video Understanding
Figure 3 for VideoNSA: Native Sparse Attention Scales Video Understanding
Figure 4 for VideoNSA: Native Sparse Attention Scales Video Understanding
Viaarxiv icon

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Add code
Mar 08, 2025
Figure 1 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Figure 2 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Figure 3 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Figure 4 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Viaarxiv icon

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Add code
Dec 18, 2024
Figure 1 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Figure 2 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Figure 3 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Figure 4 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Viaarxiv icon

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Add code
Jun 24, 2024
Figure 1 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 2 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 3 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 4 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Viaarxiv icon

Qwen Technical Report

Add code
Sep 28, 2023
Figure 1 for Qwen Technical Report
Figure 2 for Qwen Technical Report
Figure 3 for Qwen Technical Report
Figure 4 for Qwen Technical Report
Viaarxiv icon

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Add code
Sep 14, 2023
Figure 1 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Figure 2 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Figure 3 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Figure 4 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Viaarxiv icon

TouchStone: Evaluating Vision-Language Models by Language Models

Add code
Sep 04, 2023
Viaarxiv icon

ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

Add code
May 24, 2023
Figure 1 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Figure 2 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Figure 3 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Figure 4 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Viaarxiv icon