Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhe Gan

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Nov 25, 2021

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng(+2 more)

Figure 1 for An Empirical Study of Training End-to-End Vision-and-Language Transformers

Figure 2 for An Empirical Study of Training End-to-End Vision-and-Language Transformers

Figure 3 for An Empirical Study of Training End-to-End Vision-and-Language Transformers

Figure 4 for An Empirical Study of Training End-to-End Vision-and-Language Transformers

Abstract:Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIPViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer while maintaining fast inference speed. Notably, our best model achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Code and models are released at https://github.com/zdou0830/METER.

Via

Access Paper or Ask Questions

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Nov 24, 2021

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu

Figure 1 for VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Figure 2 for VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Figure 3 for VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Figure 4 for VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Abstract:A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video inputs (e.g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling. Specifically, the original video frame patches are "tokenized" into discrete visual tokens, and the goal is to recover the original visual tokens based on the masked patches. Comprehensive analysis demonstrates the effectiveness of both explicit temporal modeling via video transformer and MVM. As a result, VIOLET achieves new state-of-the-art performance on 5 video question answering tasks and 4 text-to-video retrieval tasks.

* Code is available at https://github.com/tsujuifu/pytorch_violet

Via

Access Paper or Ask Questions

Scaling Up Vision-Language Pre-training for Image Captioning

Nov 24, 2021

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

Figure 1 for Scaling Up Vision-Language Pre-training for Image Captioning

Figure 2 for Scaling Up Vision-Language Pre-training for Image Captioning

Figure 3 for Scaling Up Vision-Language Pre-training for Image Captioning

Figure 4 for Scaling Up Vision-Language Pre-training for Image Captioning

Abstract:In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

Via

Access Paper or Ask Questions

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Nov 23, 2021

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang

Figure 1 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Figure 2 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Figure 3 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Figure 4 for Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Abstract:In this paper, we propose UNICORN, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture. Specifically, we quantize each box into four discrete box tokens and serialize them as a sequence, which can be integrated with text tokens. We formulate all VL problems as a generation task, where the target sequence consists of the integrated text and box tokens. We then train a transformer encoder-decoder to predict the target in an auto-regressive manner. With such a unified framework and input-output format, UNICORN achieves comparable performance to task-specific state of the art on 7 VL benchmarks, covering the visual grounding, grounded captioning, visual question answering, and image captioning tasks. When trained with multi-task finetuning, UNICORN can approach different VL tasks with a single set of parameters, thus crossing downstream task boundary. We show that having a single model not only saves parameters, but also further boosts the model performance on certain tasks. Finally, UNICORN shows the capability of generalizing to new tasks such as ImageNet object localization.

Via

Access Paper or Ask Questions

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Nov 19, 2021

Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang

Figure 1 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Figure 2 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Figure 3 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Figure 4 for UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

Abstract:In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and achieve new state of the arts on visual question answering, COCO image captioning (cross-entropy optimization) and nocaps (in SPICE). On other downstream tasks, e.g., image-text retrieval, we also achieve competitive performance.

Via

Access Paper or Ask Questions

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Nov 04, 2021

Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, Bo Li

Figure 1 for Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Figure 2 for Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Figure 3 for Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Figure 4 for Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Abstract:Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.

* Oral Presentation in NeurIPS 2021 (Datasets and Benchmarks Track). 24 pages, 4 figures, 12 tables

Via

Access Paper or Ask Questions

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Sep 10, 2021

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang

Figure 1 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Figure 2 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Figure 3 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Figure 4 for An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Abstract:Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

Via

Access Paper or Ask Questions

Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE

Jul 02, 2021

Junya Chen, Zhe Gan, Xuan Li, Qing Guo, Liqun Chen, Shuyang Gao, Tagyoung Chung, Yi Xu, Belinda Zeng, Wenlian Lu(+3 more)

Figure 1 for Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE

Figure 2 for Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE

Figure 3 for Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE

Figure 4 for Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE

Abstract:InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years. However, these contrastive schemes are notoriously resource demanding, as their effectiveness breaks down with small-batch training (i.e., the log-K curse, whereas K is the batch-size). In this work, we reveal mathematically why contrastive learners fail in the small-batch-size regime, and present a novel simple, non-trivial contrastive objective named FlatNCE, which fixes this issue. Unlike InfoNCE, our FlatNCE no longer explicitly appeals to a discriminative classification goal for contrastive learning. Theoretically, we show FlatNCE is the mathematical dual formulation of InfoNCE, thus bridging the classical literature on energy modeling; and empirically, we demonstrate that, with minimal modification of code, FlatNCE enables immediate performance boost independent of the subject-matter engineering efforts. The significance of this work is furthered by the powerful generalization of contrastive learning techniques, and the introduction of new tools to monitor and diagnose contrastive training. We substantiate our claims with empirical evidence on CIFAR10, ImageNet, and other datasets, where FlatNCE consistently outperforms InfoNCE.

Via

Access Paper or Ask Questions

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Jun 09, 2021

Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Figure 1 for Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Figure 2 for Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Figure 3 for Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Figure 4 for Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Abstract:Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We launch and report the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. For additional efficiency gains, we further co-explore data and architecture sparsity, by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can even improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.

Via

Access Paper or Ask Questions

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Jun 08, 2021

Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang(+5 more)

Figure 1 for VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Figure 2 for VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Figure 3 for VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Figure 4 for VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Abstract:Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-leaderboard.github.io/.

* VALUE is available at https://value-leaderboard.github.io/

Via

Access Paper or Ask Questions