Picture for Saining Xie

Saining Xie

CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

Add code
Nov 07, 2025
Viaarxiv icon

Cambrian-S: Towards Spatial Supersensing in Video

Add code
Nov 06, 2025
Viaarxiv icon

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Add code
Nov 06, 2025
Viaarxiv icon

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Add code
Nov 06, 2025
Viaarxiv icon

MetaCLIP 2: A Worldwide Scaling Recipe

Add code
Jul 29, 2025
Figure 1 for MetaCLIP 2: A Worldwide Scaling Recipe
Figure 2 for MetaCLIP 2: A Worldwide Scaling Recipe
Figure 3 for MetaCLIP 2: A Worldwide Scaling Recipe
Figure 4 for MetaCLIP 2: A Worldwide Scaling Recipe
Viaarxiv icon

Spatial Mental Modeling from Limited Views

Add code
Jun 26, 2025
Figure 1 for Spatial Mental Modeling from Limited Views
Figure 2 for Spatial Mental Modeling from Limited Views
Figure 3 for Spatial Mental Modeling from Limited Views
Figure 4 for Spatial Mental Modeling from Limited Views
Viaarxiv icon

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Add code
Jun 13, 2025
Viaarxiv icon

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Add code
May 21, 2025
Figure 1 for Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Figure 2 for Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Figure 3 for Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Figure 4 for Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Viaarxiv icon

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Add code
May 15, 2025
Figure 1 for Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Figure 2 for Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Figure 3 for Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Figure 4 for Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Viaarxiv icon

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Add code
May 14, 2025
Viaarxiv icon