Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiasen Lu

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Dec 28, 2023
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

* 38 pages, 20 figures

Via

Access Paper or Ask Questions

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jun 17, 2022
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi

Figure 1 for Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Figure 2 for Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Figure 3 for Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Figure 4 for Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression comprehension, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 80 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task or benchmark specific fine-tuning. Demos for Unified-IO are available at https://unified-io.allenai.org.

Via

Access Paper or Ask Questions

ASC me to Do Anything: Multi-task Training for Embodied AI

Feb 14, 2022
Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi

Figure 1 for ASC me to Do Anything: Multi-task Training for Embodied AI

Figure 2 for ASC me to Do Anything: Multi-task Training for Embodied AI

Figure 3 for ASC me to Do Anything: Multi-task Training for Embodied AI

Figure 4 for ASC me to Do Anything: Multi-task Training for Embodied AI

Embodied AI has seen steady progress across a diverse set of independent tasks. While these varied tasks have different end goals, the basic skills required to complete them successfully overlap significantly. In this paper, our goal is to leverage these shared skills to learn to perform multiple tasks jointly. We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embodied AI, where a set of atomic skills shared across multiple tasks are composed together to perform the tasks. The key to the success of this approach is a pre-training scheme that decouples learning of the skills from the high-level tasks making joint training effective. We use ASC to train agents within the AI2-THOR environment to perform four interactive tasks jointly and find it to be remarkably effective. In a multi-task setting, ASC improves success rates by a factor of 2x on Seen scenes and 4x on Unseen scenes compared to no pre-training. Importantly, ASC enables us to train a multi-task agent that has a 52% higher Success Rate than training 4 independent single task agents. Finally, our hierarchical agents are more interpretable than traditional black-box architectures.

* 22 pages, 11 figures

Via

Access Paper or Ask Questions

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Jan 07, 2022
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi

Figure 1 for MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Figure 2 for MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Figure 3 for MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Figure 4 for MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

* project page at https://rowanzellers.com/merlotreserve

Via

Access Paper or Ask Questions

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Nov 29, 2021
Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao

Figure 1 for A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Figure 2 for A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Figure 3 for A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Figure 4 for A Simple Long-Tailed Recognition Baseline via Vision-Language Model

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to address the problem. However, they still train models with a finite set of predefined labels, limiting their supervision information and restricting their transferability to novel instances. Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition. With open-vocabulary supervisions, pretrained contrastive vision-language models learn powerful multimodal representations that are promising to handle data deficiency and unseen concepts. By calculating the semantic similarity between visual and text inputs, visual recognition is converted to a vision-language matching problem. Inspired by this, we propose BALLAD to leverage contrastive vision-language models for long-tailed recognition. We first continue pretraining the vision-language backbone through contrastive learning on a specific long-tailed target dataset. Afterward, we freeze the backbone and further employ an additional adapter layer to enhance the representations of tail classes on balanced training samples built with re-sampling strategies. Extensive experiments have been conducted on three popular long-tailed recognition benchmarks. As a result, our simple and effective approach sets the new state-of-the-art performances and outperforms competitive baselines with a large margin. Code is released at https://github.com/gaopengcuhk/BALLAD.

Via

Access Paper or Ask Questions

Container: Context Aggregation Network

Jun 02, 2021
Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

Figure 1 for Container: Context Aggregation Network

Figure 2 for Container: Context Aggregation Network

Figure 3 for Container: Context Aggregation Network

Figure 4 for Container: Context Aggregation Network

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.

Via

Access Paper or Ask Questions

Multi-Modal Answer Validation for Knowledge-Based VQA

Mar 23, 2021
Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi

Figure 1 for Multi-Modal Answer Validation for Knowledge-Based VQA

Figure 2 for Multi-Modal Answer Validation for Knowledge-Based VQA

Figure 3 for Multi-Modal Answer Validation for Knowledge-Based VQA

Figure 4 for Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in a variety of forms, including visual, textual, and commonsense knowledge. The use of more knowledge sources, however, also increases the chance of retrieving more irrelevant or noisy facts, making it difficult to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using External knowledge (MAVEx), where the idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval. This is in contrast to existing approaches that search for the answer in a vast collection of often irrelevant facts. Our approach aims to learn which knowledge source should be trusted for each answer candidate and how to validate the candidate using that source. We consider a multi-modal setting, relying on both textual and visual knowledge resources, including images searched using Google, sentences from Wikipedia articles, and concepts from ConceptNet. Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results.

Via

Access Paper or Ask Questions

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Sep 23, 2020
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

Figure 1 for X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Figure 2 for X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Figure 3 for X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Figure 4 for X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.

* EMNLP 2020

Via

Access Paper or Ask Questions

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Jul 24, 2020
Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, Dhruv Batra

Figure 1 for Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Figure 2 for Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Figure 3 for Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Figure 4 for Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans. Code has been made available at https://github.com/mcogswell/dialog_without_dialog

* 19 pages, 8 figures

Via

Access Paper or Ask Questions

Spatially Aware Multimodal Transformers for TextVQA

Jul 23, 2020
Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal

Figure 1 for Spatially Aware Multimodal Transformers for TextVQA

Figure 2 for Spatially Aware Multimodal Transformers for TextVQA

Figure 3 for Spatially Aware Multimodal Transformers for TextVQA

Figure 4 for Spatially Aware Multimodal Transformers for TextVQA

Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.

* Accepted at European Conference on Computer Vision 2020

Via

Access Paper or Ask Questions