Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yinfei Yang

A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Oct 06, 2022

Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Jason Baldridge, Zarana Parekh

Figure 1 for A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Figure 2 for A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Figure 3 for A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Figure 4 for A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Abstract:Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards intelligent agents or robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pre-training on large text and image-text datasets from the web has been extensively explored but the improvements are limited. To address the scarcity of in-domain instruction data, we investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 deg panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky (Wang et al., 2022), a high-quality multilingual navigation instruction generator. To further increase the variability of the trajectories, we also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a transformer agent with imitation learning for over 700M steps of experience. On the challenging Room-across-Room dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.

Via

Access Paper or Ask Questions

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jun 22, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan(+7 more)

Figure 1 for Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Figure 2 for Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Figure 3 for Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Figure 4 for Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Abstract:We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

* Preprint

Via

Access Paper or Ask Questions

Simple and Effective Synthesis of Indoor 3D Scenes

Apr 06, 2022

Jing Yu Koh, Harsh Agrawal, Dhruv Batra, Richard Tucker, Austin Waters, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

Figure 1 for Simple and Effective Synthesis of Indoor 3D Scenes

Figure 2 for Simple and Effective Synthesis of Indoor 3D Scenes

Figure 3 for Simple and Effective Synthesis of Indoor 3D Scenes

Figure 4 for Simple and Effective Synthesis of Indoor 3D Scenes

Abstract:We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images. On the Matterport3D and RealEstate10K datasets, our approach significantly outperforms prior work when evaluated by humans, as well as on FID scores. Further, we show that our model is useful for generative data augmentation. A vision-and-language navigation (VLN) agent trained with trajectories spatially-perturbed by our model improves success rate by up to 1.5% over a state of the art baseline on the R2R benchmark. Our code will be made available to facilitate generative data augmentation and applications to downstream robotics and embodied AI tasks.

Via

Access Paper or Ask Questions

LongT5: Efficient Text-To-Text Transformer for Long Sequences

Dec 15, 2021

Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang

Figure 1 for LongT5: Efficient Text-To-Text Transformer for Long Sequences

Figure 2 for LongT5: Efficient Text-To-Text Transformer for Long Sequences

Figure 3 for LongT5: Efficient Text-To-Text Transformer for Long Sequences

Figure 4 for LongT5: Efficient Text-To-Text Transformer for Long Sequences

Abstract:Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.

* preprint

Via

Access Paper or Ask Questions

Large Dual Encoders Are Generalizable Retrievers

Dec 15, 2021

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang(+1 more)

Figure 1 for Large Dual Encoders Are Generalizable Retrievers

Figure 2 for Large Dual Encoders Are Generalizable Retrievers

Figure 3 for Large Dual Encoders Are Generalizable Retrievers

Figure 4 for Large Dual Encoders Are Generalizable Retrievers

Abstract:It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-of-domain generalization. In this paper, we challenge this belief by scaling up the size of the dual encoder model {\em while keeping the bottleneck embedding size fixed.} With multi-stage training, surprisingly, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. Experimental results show that our dual encoders, \textbf{G}eneralizable \textbf{T}5-based dense \textbf{R}etrievers (GTR), outperform %ColBERT~\cite{khattab2020colbert} and existing sparse and dense retrievers on the BEIR dataset~\cite{thakur2021beir} significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10\% of MS Marco supervised data to achieve the best out-of-domain performance. All the GTR models are released at https://tfhub.dev/google/collections/gtr/1.

Via

Access Paper or Ask Questions

MURAL: Multimodal, Multitask Retrieval Across Languages

Sep 10, 2021

Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, Jason Baldridge

Figure 1 for MURAL: Multimodal, Multitask Retrieval Across Languages

Figure 2 for MURAL: Multimodal, Multitask Retrieval Across Languages

Figure 3 for MURAL: Multimodal, Multitask Retrieval Across Languages

Figure 4 for MURAL: Multimodal, Multitask Retrieval Across Languages

Abstract:Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.

Via

Access Paper or Ask Questions

A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Sep 10, 2021

Ziyi Yang, Yinfei Yang, Daniel Cer, Eric Darve

Figure 1 for A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Figure 2 for A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Figure 3 for A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Figure 4 for A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Abstract:Language agnostic and semantic-language information isolation is an emerging research direction for multilingual representations models. We explore this problem from a novel angle of geometric algebra and semantic space. A simple but highly effective method "Language Information Removal (LIR)" factors out language identity information from semantic related components in multilingual representations pre-trained on multi-monolingual data. A post-training and model-agnostic method, LIR only uses simple linear operations, e.g. matrix factorization and orthogonal projection. LIR reveals that for weak-alignment multilingual systems, the principal components of semantic spaces primarily encodes language identity information. We first evaluate the LIR on a cross-lingual question answer retrieval task (LAReQA), which requires the strong alignment for the multilingual embedding space. Experiment shows that LIR is highly effectively on this task, yielding almost 100% relative improvement in MAP for weak-alignment models. We then evaluate the LIR on Amazon Reviews and XEVAL dataset, with the observation that removing language information is able to improve the cross-lingual transfer performance.

* Accepted to the 2021 Conference on Empirical Methods in Natural Language Processing

Via

Access Paper or Ask Questions

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Aug 26, 2021

Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, Yinfei Yang

Figure 1 for Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Figure 2 for Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Figure 3 for Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Figure 4 for Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Abstract:We provide the first exploration of text-to-text transformers (T5) sentence embeddings. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks cast as sequence-to-sequence mapping problems, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods for extracting T5 sentence embeddings: two utilize only the T5 encoder and one uses the full T5 encoder-decoder model. Our encoder-only models outperforms BERT-based sentence embeddings on both transfer tasks and semantic textual similarity (STS). Our encoder-decoder method achieves further improvement on STS. Scaling up T5 from millions to billions of parameters is found to produce consistent improvements on downstream tasks. Finally, we introduce a two-stage contrastive learning approach that achieves a new state-of-art on STS using sentence embeddings, outperforming both Sentence BERT and SimCSE.

Via

Access Paper or Ask Questions

Pathdreamer: A World Model for Indoor Navigation

May 18, 2021

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

Figure 1 for Pathdreamer: A World Model for Indoor Navigation

Figure 2 for Pathdreamer: A World Model for Indoor Navigation

Figure 3 for Pathdreamer: A World Model for Indoor Navigation

Figure 4 for Pathdreamer: A World Model for Indoor Navigation

Abstract:People navigating in unfamiliar buildings take advantage of myriad visual, spatial and semantic cues to efficiently achieve their navigation goals. Towards equipping computational agents with similar capabilities, we introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations (RGB, semantic segmentation and depth) for viewpoints that have not been visited, in buildings not seen during training. In regions of high uncertainty (e.g. predicting around corners, imagining the contents of an unseen room), Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes for a given trajectory. We demonstrate that Pathdreamer encodes useful and accessible visual, spatial and semantic knowledge about human environments by using it in the downstream task of Vision-and-Language Navigation (VLN). Specifically, we show that planning ahead with Pathdreamer brings about half the benefit of looking ahead at actual observations from unobserved parts of the environment. We hope that Pathdreamer will help unlock model-based approaches to challenging embodied navigation tasks such as navigating to specified objects and VLN.

Via

Access Paper or Ask Questions

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Feb 11, 2021

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig

Figure 1 for Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Figure 2 for Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Figure 3 for Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Figure 4 for Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Abstract:Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

Via

Access Paper or Ask Questions