Alert button
Picture for Xitong Yang

Xitong Yang

Alert button

Video ReCap: Recursive Captioning of Hour-Long Videos

Add code
Bookmark button
Alert button
Feb 28, 2024
Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

Viaarxiv icon

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Add code
Bookmark button
Alert button
Nov 30, 2023
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Viaarxiv icon

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

Add code
Bookmark button
Alert button
Oct 08, 2023
Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang

Figure 1 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Figure 2 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Figure 3 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Figure 4 for Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Viaarxiv icon

Towards Scalable Neural Representation for Diverse Videos

Add code
Bookmark button
Alert button
Mar 24, 2023
Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava

Figure 1 for Towards Scalable Neural Representation for Diverse Videos
Figure 2 for Towards Scalable Neural Representation for Diverse Videos
Figure 3 for Towards Scalable Neural Representation for Diverse Videos
Figure 4 for Towards Scalable Neural Representation for Diverse Videos
Viaarxiv icon

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Add code
Bookmark button
Alert button
Feb 16, 2023
Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

Figure 1 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 2 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 3 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Figure 4 for MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Viaarxiv icon

Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Add code
Bookmark button
Alert button
Feb 01, 2023
Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Figure 2 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Figure 3 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Figure 4 for Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Viaarxiv icon

Vision Transformers Are Good Mask Auto-Labelers

Add code
Bookmark button
Alert button
Jan 10, 2023
Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar

Figure 1 for Vision Transformers Are Good Mask Auto-Labelers
Figure 2 for Vision Transformers Are Good Mask Auto-Labelers
Figure 3 for Vision Transformers Are Good Mask Auto-Labelers
Figure 4 for Vision Transformers Are Good Mask Auto-Labelers
Viaarxiv icon

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization

Add code
Bookmark button
Alert button
Mar 29, 2022
Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav Shrivastava

Figure 1 for ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Figure 2 for ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Figure 3 for ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Figure 4 for ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Viaarxiv icon

Efficient Video Transformers with Spatial-Temporal Token Selection

Add code
Bookmark button
Alert button
Nov 23, 2021
Junke Wang, Xitong Yang, Hengduo Li, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for Efficient Video Transformers with Spatial-Temporal Token Selection
Figure 2 for Efficient Video Transformers with Spatial-Temporal Token Selection
Figure 3 for Efficient Video Transformers with Spatial-Temporal Token Selection
Figure 4 for Efficient Video Transformers with Spatial-Temporal Token Selection
Viaarxiv icon