Picture for Hassan Akbari

Hassan Akbari

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dec 21, 2023
Figure 1 for VideoPoet: A Large Language Model for Zero-Shot Video Generation
Figure 2 for VideoPoet: A Large Language Model for Zero-Shot Video Generation
Figure 3 for VideoPoet: A Large Language Model for Zero-Shot Video Generation
Figure 4 for VideoPoet: A Large Language Model for Zero-Shot Video Generation
Viaarxiv icon

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Add code
May 10, 2023
Figure 1 for Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Figure 2 for Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Figure 3 for Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Figure 4 for Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Viaarxiv icon

Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Nov 03, 2022
Figure 1 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Figure 2 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Figure 3 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Figure 4 for Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Viaarxiv icon

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Add code
Sep 16, 2022
Figure 1 for PaLI: A Jointly-Scaled Multilingual Language-Image Model
Figure 2 for PaLI: A Jointly-Scaled Multilingual Language-Image Model
Figure 3 for PaLI: A Jointly-Scaled Multilingual Language-Image Model
Figure 4 for PaLI: A Jointly-Scaled Multilingual Language-Image Model
Viaarxiv icon

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Add code
Apr 22, 2021
Figure 1 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 2 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 3 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 4 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Viaarxiv icon

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Add code
Nov 18, 2020
Figure 1 for Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Figure 2 for Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Figure 3 for Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Figure 4 for Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Viaarxiv icon

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

Add code
Nov 28, 2018
Figure 1 for Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
Figure 2 for Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
Figure 3 for Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
Viaarxiv icon

Lip2AudSpec: Speech reconstruction from silent lip movements video

Add code
Oct 26, 2017
Figure 1 for Lip2AudSpec: Speech reconstruction from silent lip movements video
Figure 2 for Lip2AudSpec: Speech reconstruction from silent lip movements video
Figure 3 for Lip2AudSpec: Speech reconstruction from silent lip movements video
Figure 4 for Lip2AudSpec: Speech reconstruction from silent lip movements video
Viaarxiv icon