Picture for Xin Gu

Xin Gu

GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

Add code
Mar 11, 2026
Viaarxiv icon

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Add code
Mar 02, 2026
Viaarxiv icon

Towards Long-Form Spatio-Temporal Video Grounding

Add code
Feb 26, 2026
Viaarxiv icon

Cocoon: A System Architecture for Differentially Private Training with Correlated Noises

Add code
Oct 08, 2025
Viaarxiv icon

Generalized Scattering Matrix Framework for Modeling Implantable Antennas in Multilayered Spherical Media

Add code
Jul 17, 2025
Viaarxiv icon

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Add code
May 05, 2025
Viaarxiv icon

Vidi: Large Multimodal Models for Video Understanding and Editing

Add code
Apr 22, 2025
Figure 1 for Vidi: Large Multimodal Models for Video Understanding and Editing
Figure 2 for Vidi: Large Multimodal Models for Video Understanding and Editing
Figure 3 for Vidi: Large Multimodal Models for Video Understanding and Editing
Figure 4 for Vidi: Large Multimodal Models for Video Understanding and Editing
Viaarxiv icon

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Add code
Mar 13, 2025
Viaarxiv icon

Multi-Reward as Condition for Instruction-based Image Editing

Add code
Nov 06, 2024
Figure 1 for Multi-Reward as Condition for Instruction-based Image Editing
Viaarxiv icon

Edit3K: Universal Representation Learning for Video Editing Components

Add code
Mar 24, 2024
Viaarxiv icon