Picture for Khoa Vo

Khoa Vo

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

Add code
Nov 18, 2025
Viaarxiv icon

SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

Add code
Nov 10, 2025
Viaarxiv icon

Amodal Instance Segmentation with Diffusion Shape Prior Estimation

Add code
Sep 26, 2024
Viaarxiv icon

Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge

Add code
Jul 21, 2024
Figure 1 for Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge
Figure 2 for Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge
Figure 3 for Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge
Viaarxiv icon

HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model

Add code
Jun 01, 2024
Viaarxiv icon

ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

Add code
Mar 22, 2024
Viaarxiv icon

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Add code
Nov 04, 2023
Figure 1 for ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection
Figure 2 for ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection
Figure 3 for ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection
Figure 4 for ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection
Viaarxiv icon

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Add code
Oct 05, 2023
Figure 1 for Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Figure 2 for Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Figure 3 for Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Figure 4 for Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Viaarxiv icon

Contextual Explainable Video Representation: Human Perception-based Understanding

Add code
Dec 17, 2022
Viaarxiv icon

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Add code
Dec 09, 2022
Viaarxiv icon