Picture for Zhan Tong

Zhan Tong

Contextual AD Narration with Interleaved Multimodal Sequence

Add code
Mar 19, 2024
Viaarxiv icon

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Add code
Dec 26, 2023
Figure 1 for TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
Figure 2 for TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
Figure 3 for TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
Figure 4 for TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
Viaarxiv icon

Bootstrapping SparseFormers from Vision Foundation Models

Add code
Dec 04, 2023
Figure 1 for Bootstrapping SparseFormers from Vision Foundation Models
Figure 2 for Bootstrapping SparseFormers from Vision Foundation Models
Figure 3 for Bootstrapping SparseFormers from Vision Foundation Models
Figure 4 for Bootstrapping SparseFormers from Vision Foundation Models
Viaarxiv icon

Advancing Vision Transformers with Group-Mix Attention

Add code
Nov 26, 2023
Figure 1 for Advancing Vision Transformers with Group-Mix Attention
Figure 2 for Advancing Vision Transformers with Group-Mix Attention
Figure 3 for Advancing Vision Transformers with Group-Mix Attention
Figure 4 for Advancing Vision Transformers with Group-Mix Attention
Viaarxiv icon

Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

Add code
Sep 25, 2023
Viaarxiv icon

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Add code
May 23, 2023
Viaarxiv icon

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Add code
Apr 18, 2023
Figure 1 for VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Figure 2 for VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Figure 3 for VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Figure 4 for VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Viaarxiv icon

Efficient Video Action Detection with Token Dropout and Context Refinement

Add code
Apr 17, 2023
Viaarxiv icon

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Add code
Apr 07, 2023
Figure 1 for SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
Figure 2 for SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
Figure 3 for SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
Figure 4 for SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
Viaarxiv icon

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

Add code
Mar 30, 2023
Viaarxiv icon