Picture for Xitong Yang

Xitong Yang

Progress-Aware Video Frame Captioning

Add code
Dec 03, 2024
Figure 1 for Progress-Aware Video Frame Captioning
Figure 2 for Progress-Aware Video Frame Captioning
Figure 3 for Progress-Aware Video Frame Captioning
Figure 4 for Progress-Aware Video Frame Captioning
Viaarxiv icon

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Add code
Sep 30, 2024
Figure 1 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Figure 2 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Figure 3 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Figure 4 for Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Viaarxiv icon

GenRec: Unifying Video Generation and Recognition with Diffusion Models

Add code
Aug 27, 2024
Figure 1 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Figure 2 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Figure 3 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Figure 4 for GenRec: Unifying Video Generation and Recognition with Diffusion Models
Viaarxiv icon

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

Add code
Aug 07, 2024
Viaarxiv icon

Video ReCap: Recursive Captioning of Hour-Long Videos

Add code
Feb 28, 2024
Figure 1 for Video ReCap: Recursive Captioning of Hour-Long Videos
Figure 2 for Video ReCap: Recursive Captioning of Hour-Long Videos
Figure 3 for Video ReCap: Recursive Captioning of Hour-Long Videos
Figure 4 for Video ReCap: Recursive Captioning of Hour-Long Videos
Viaarxiv icon

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Add code
Nov 30, 2023
Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Viaarxiv icon

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

Add code
Oct 08, 2023
Figure 1 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Figure 2 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Figure 3 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Figure 4 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Viaarxiv icon

Towards Scalable Neural Representation for Diverse Videos

Add code
Mar 24, 2023
Figure 1 for Towards Scalable Neural Representation for Diverse Videos
Figure 2 for Towards Scalable Neural Representation for Diverse Videos
Figure 3 for Towards Scalable Neural Representation for Diverse Videos
Figure 4 for Towards Scalable Neural Representation for Diverse Videos
Viaarxiv icon

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Add code
Feb 16, 2023
Figure 1 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 2 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 3 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 4 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Viaarxiv icon

Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Add code
Feb 01, 2023
Figure 1 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Figure 2 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Figure 3 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Figure 4 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Viaarxiv icon