Picture for Shusheng Yang

Shusheng Yang

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Add code
Mar 08, 2025
Figure 1 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Figure 2 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Figure 3 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Figure 4 for Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Viaarxiv icon

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Add code
Dec 18, 2024
Figure 1 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Figure 2 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Figure 3 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Figure 4 for Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Viaarxiv icon

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Add code
Jun 24, 2024
Figure 1 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 2 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 3 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Figure 4 for Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Viaarxiv icon

Qwen Technical Report

Add code
Sep 28, 2023
Figure 1 for Qwen Technical Report
Figure 2 for Qwen Technical Report
Figure 3 for Qwen Technical Report
Figure 4 for Qwen Technical Report
Viaarxiv icon

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Add code
Sep 14, 2023
Figure 1 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Figure 2 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Figure 3 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Figure 4 for Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Viaarxiv icon

TouchStone: Evaluating Vision-Language Models by Language Models

Add code
Sep 04, 2023
Viaarxiv icon

ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

Add code
May 24, 2023
Figure 1 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Figure 2 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Figure 3 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Figure 4 for ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Viaarxiv icon

MobileInst: Video Instance Segmentation on the Mobile

Add code
Mar 30, 2023
Figure 1 for MobileInst: Video Instance Segmentation on the Mobile
Figure 2 for MobileInst: Video Instance Segmentation on the Mobile
Figure 3 for MobileInst: Video Instance Segmentation on the Mobile
Figure 4 for MobileInst: Video Instance Segmentation on the Mobile
Viaarxiv icon

Masked Visual Reconstruction in Language Semantic Space

Add code
Jan 17, 2023
Figure 1 for Masked Visual Reconstruction in Language Semantic Space
Figure 2 for Masked Visual Reconstruction in Language Semantic Space
Figure 3 for Masked Visual Reconstruction in Language Semantic Space
Figure 4 for Masked Visual Reconstruction in Language Semantic Space
Viaarxiv icon

Masked Image Modeling with Denoising Contrast

Add code
May 19, 2022
Figure 1 for Masked Image Modeling with Denoising Contrast
Figure 2 for Masked Image Modeling with Denoising Contrast
Figure 3 for Masked Image Modeling with Denoising Contrast
Figure 4 for Masked Image Modeling with Denoising Contrast
Viaarxiv icon