Picture for Le Xue

Le Xue

Enabling Ultra-Fast Cardiovascular Imaging Across Heterogeneous Clinical Environments with a Generalist Foundation Model and Multimodal Database

Add code
Dec 25, 2025
Viaarxiv icon

Robotic VLA Benefits from Joint Learning with Motion Image Diffusion

Add code
Dec 19, 2025
Viaarxiv icon

PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

Add code
Aug 06, 2025
Viaarxiv icon

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Add code
May 14, 2025
Viaarxiv icon

SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models

Add code
Feb 28, 2025
Figure 1 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Figure 2 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Figure 3 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Figure 4 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Viaarxiv icon

SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images

Add code
Feb 20, 2025
Figure 1 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Figure 2 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Figure 3 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Figure 4 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Viaarxiv icon

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Add code
Dec 09, 2024
Figure 1 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 2 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 3 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 4 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Viaarxiv icon

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Add code
Nov 12, 2024
Figure 1 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Figure 2 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Figure 3 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Figure 4 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Viaarxiv icon

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Add code
Oct 21, 2024
Figure 1 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 2 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 3 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 4 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Viaarxiv icon

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Add code
Aug 22, 2024
Figure 1 for xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Figure 2 for xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Figure 3 for xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Figure 4 for xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Viaarxiv icon