Picture for Po-Yao Huang

Po-Yao Huang

Demystifying CLIP Data

Add code
Oct 02, 2023
Figure 1 for Demystifying CLIP Data
Figure 2 for Demystifying CLIP Data
Figure 3 for Demystifying CLIP Data
Figure 4 for Demystifying CLIP Data
Viaarxiv icon

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Add code
Sep 19, 2023
Figure 1 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Figure 2 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Figure 3 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Viaarxiv icon

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Add code
Jun 01, 2023
Figure 1 for Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Figure 2 for Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Figure 3 for Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Figure 4 for Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Viaarxiv icon

DINOv2: Learning Robust Visual Features without Supervision

Add code
Apr 14, 2023
Figure 1 for DINOv2: Learning Robust Visual Features without Supervision
Figure 2 for DINOv2: Learning Robust Visual Features without Supervision
Figure 3 for DINOv2: Learning Robust Visual Features without Supervision
Figure 4 for DINOv2: Learning Robust Visual Features without Supervision
Viaarxiv icon

Diffusion Models as Masked Autoencoders

Add code
Apr 06, 2023
Figure 1 for Diffusion Models as Masked Autoencoders
Figure 2 for Diffusion Models as Masked Autoencoders
Figure 3 for Diffusion Models as Masked Autoencoders
Figure 4 for Diffusion Models as Masked Autoencoders
Viaarxiv icon

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Add code
Mar 31, 2023
Figure 1 for STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Figure 2 for STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Figure 3 for STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Figure 4 for STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Viaarxiv icon

CiT: Curation in Training for Effective Vision-Language Data

Add code
Jan 05, 2023
Figure 1 for CiT: Curation in Training for Effective Vision-Language Data
Figure 2 for CiT: Curation in Training for Effective Vision-Language Data
Figure 3 for CiT: Curation in Training for Effective Vision-Language Data
Figure 4 for CiT: Curation in Training for Effective Vision-Language Data
Viaarxiv icon

MAViL: Masked Audio-Video Learners

Add code
Dec 15, 2022
Figure 1 for MAViL: Masked Audio-Video Learners
Figure 2 for MAViL: Masked Audio-Video Learners
Figure 3 for MAViL: Masked Audio-Video Learners
Figure 4 for MAViL: Masked Audio-Video Learners
Viaarxiv icon

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

Add code
Apr 03, 2022
Figure 1 for AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Figure 2 for AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Figure 3 for AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Figure 4 for AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Viaarxiv icon

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Add code
Oct 01, 2021
Figure 1 for VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Figure 2 for VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Figure 3 for VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Figure 4 for VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Viaarxiv icon