Picture for Shijie Li

Shijie Li

Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

Add code
Mar 05, 2026
Viaarxiv icon

DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows

Add code
Feb 28, 2026
Viaarxiv icon

GLM-5: from Vibe Coding to Agentic Engineering

Add code
Feb 17, 2026
Viaarxiv icon

One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

Add code
Feb 17, 2026
Viaarxiv icon

DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

Add code
Jan 26, 2026
Viaarxiv icon

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Add code
Jan 25, 2026
Viaarxiv icon

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Add code
Oct 02, 2025
Figure 1 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Figure 2 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Figure 3 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Figure 4 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Viaarxiv icon

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Add code
Aug 08, 2025
Viaarxiv icon

CogStream: Context-guided Streaming Video Question Answering

Add code
Jun 12, 2025
Figure 1 for CogStream: Context-guided Streaming Video Question Answering
Figure 2 for CogStream: Context-guided Streaming Video Question Answering
Figure 3 for CogStream: Context-guided Streaming Video Question Answering
Figure 4 for CogStream: Context-guided Streaming Video Question Answering
Viaarxiv icon

Zero-Shot 3D Visual Grounding from Vision-Language Models

Add code
May 28, 2025
Figure 1 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Figure 2 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Figure 3 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Figure 4 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Viaarxiv icon