Picture for Gedas Bertasius

Gedas Bertasius

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Add code
May 29, 2024
Figure 1 for VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Figure 2 for VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Figure 3 for VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Figure 4 for VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Viaarxiv icon

Siamese Vision Transformers are Scalable Audio-visual Learners

Add code
Mar 28, 2024
Figure 1 for Siamese Vision Transformers are Scalable Audio-visual Learners
Figure 2 for Siamese Vision Transformers are Scalable Audio-visual Learners
Figure 3 for Siamese Vision Transformers are Scalable Audio-visual Learners
Figure 4 for Siamese Vision Transformers are Scalable Audio-visual Learners
Viaarxiv icon

Augmented Reality Demonstrations for Scalable Robot Imitation Learning

Add code
Mar 20, 2024
Figure 1 for Augmented Reality Demonstrations for Scalable Robot Imitation Learning
Figure 2 for Augmented Reality Demonstrations for Scalable Robot Imitation Learning
Figure 3 for Augmented Reality Demonstrations for Scalable Robot Imitation Learning
Figure 4 for Augmented Reality Demonstrations for Scalable Robot Imitation Learning
Viaarxiv icon

DAM: Dynamic Adapter Merging for Continual Video QA Learning

Add code
Mar 13, 2024
Figure 1 for DAM: Dynamic Adapter Merging for Continual Video QA Learning
Figure 2 for DAM: Dynamic Adapter Merging for Continual Video QA Learning
Figure 3 for DAM: Dynamic Adapter Merging for Continual Video QA Learning
Figure 4 for DAM: Dynamic Adapter Merging for Continual Video QA Learning
Viaarxiv icon

Video ReCap: Recursive Captioning of Hour-Long Videos

Add code
Feb 28, 2024
Viaarxiv icon

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Add code
Jan 25, 2024
Viaarxiv icon

A Simple LLM Framework for Long-Range Video Question-Answering

Add code
Dec 28, 2023
Viaarxiv icon

RGNet: A Unified Retrieval and Grounding Network for Long Videos

Add code
Dec 11, 2023
Viaarxiv icon

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Add code
Nov 30, 2023
Figure 1 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 2 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 3 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Figure 4 for Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Viaarxiv icon

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Add code
Sep 18, 2023
Figure 1 for Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Figure 2 for Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Figure 3 for Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Figure 4 for Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Viaarxiv icon