Picture for Sicheng Yang

Sicheng Yang

Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

Add code
Jan 15, 2026
Viaarxiv icon

VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation

Add code
Jan 15, 2026
Viaarxiv icon

Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation

Add code
Nov 12, 2025
Figure 1 for Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
Figure 2 for Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
Figure 3 for Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
Figure 4 for Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
Viaarxiv icon

K-Stain: Keypoint-Driven Correspondence for H&E-to-IHC Virtual Staining

Add code
Nov 10, 2025
Viaarxiv icon

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Add code
Nov 10, 2025
Viaarxiv icon

AutoMR: A Universal Time Series Motion Recognition Pipeline

Add code
Feb 21, 2025
Figure 1 for AutoMR: A Universal Time Series Motion Recognition Pipeline
Figure 2 for AutoMR: A Universal Time Series Motion Recognition Pipeline
Figure 3 for AutoMR: A Universal Time Series Motion Recognition Pipeline
Figure 4 for AutoMR: A Universal Time Series Motion Recognition Pipeline
Viaarxiv icon

Duo Streamers: A Streaming Gesture Recognition Framework

Add code
Feb 17, 2025
Figure 1 for Duo Streamers: A Streaming Gesture Recognition Framework
Figure 2 for Duo Streamers: A Streaming Gesture Recognition Framework
Figure 3 for Duo Streamers: A Streaming Gesture Recognition Framework
Figure 4 for Duo Streamers: A Streaming Gesture Recognition Framework
Viaarxiv icon

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

Add code
Jan 06, 2025
Figure 1 for VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception
Figure 2 for VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception
Figure 3 for VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception
Figure 4 for VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception
Viaarxiv icon

Cross-conditioned Diffusion Model for Medical Image to Image Translation

Add code
Sep 13, 2024
Viaarxiv icon

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Add code
Apr 02, 2024
Figure 1 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Figure 2 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Figure 3 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Figure 4 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Viaarxiv icon