Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Donald Metzler

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

Jun 29, 2021

Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler

Figure 1 for SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

Figure 2 for SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

Figure 3 for SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

Figure 4 for SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

Abstract:Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific and little has been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors.

Via

Access Paper or Ask Questions

How Reliable are Model Diagnostics?

May 12, 2021

Vamsi Aribandi, Yi Tay, Donald Metzler

Figure 1 for How Reliable are Model Diagnostics?

Figure 2 for How Reliable are Model Diagnostics?

Figure 3 for How Reliable are Model Diagnostics?

Figure 4 for How Reliable are Model Diagnostics?

Abstract:In the pursuit of a deeper understanding of a model's behaviour, there is recent impetus for developing suites of probes aimed at diagnosing models beyond simple metrics like accuracy or BLEU. This paper takes a step back and asks an important and timely question: how reliable are these diagnostics in providing insight into models and training setups? We critically examine three recent diagnostic tests for pre-trained language models, and find that likelihood-based and representation-based model diagnostics are not yet as reliable as previously assumed. Based on our empirical findings, we also formulate recommendations for practitioners and researchers.

* ACL 2021 Findings

Via

Access Paper or Ask Questions

Are Pre-trained Convolutions Better than Pre-trained Transformers?

May 07, 2021

Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, Donald Metzler

Figure 1 for Are Pre-trained Convolutions Better than Pre-trained Transformers?

Figure 2 for Are Pre-trained Convolutions Better than Pre-trained Transformers?

Figure 3 for Are Pre-trained Convolutions Better than Pre-trained Transformers?

Figure 4 for Are Pre-trained Convolutions Better than Pre-trained Transformers?

Abstract:In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.

* Accepted to ACL 2021

Via

Access Paper or Ask Questions

Rethinking Search: Making Experts out of Dilettantes

May 05, 2021

Donald Metzler, Yi Tay, Dara Bahri, Marc Najork

Figure 1 for Rethinking Search: Making Experts out of Dilettantes

Figure 2 for Rethinking Search: Making Experts out of Dilettantes

Figure 3 for Rethinking Search: Making Experts out of Dilettantes

Abstract:When experiencing an information need, users want to engage with an expert, but often turn to an information retrieval system, such as a search engine, instead. Classical information retrieval systems do not answer information needs directly, but instead provide references to (hopefully authoritative) answers. Successful question answering systems offer a limited corpus created on-demand by human experts, which is neither timely nor scalable. Large pre-trained language models, by contrast, are capable of directly generating prose that may be responsive to an information need, but at present they are dilettantes rather than experts - they do not have a true understanding of the world, they are prone to hallucinating, and crucially they are incapable of justifying their utterances by referring to supporting documents in the corpus they were trained over. This paper examines how ideas from classical information retrieval and large pre-trained language models can be synthesized and evolved into systems that truly deliver on the promise of expert advice.

Via

Access Paper or Ask Questions

OmniNet: Omnidirectional Representations from Transformers

Mar 01, 2021

Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

Figure 1 for OmniNet: Omnidirectional Representations from Transformers

Figure 2 for OmniNet: Omnidirectional Representations from Transformers

Figure 3 for OmniNet: Omnidirectional Representations from Transformers

Figure 4 for OmniNet: Omnidirectional Representations from Transformers

Abstract:This paper proposes Omnidirectional Representations from Transformers (OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirectional attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based (Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer et al.) as the meta-learner. Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves considerable improvements across these tasks, including achieving state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr, and Long Range Arena. Moreover, using omnidirectional representation in Vision Transformers leads to significant improvements on image recognition tasks on both few-shot learning and fine-tuning setups.

Via

Access Paper or Ask Questions

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Feb 09, 2021

Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler

Figure 1 for Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Figure 2 for Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Figure 3 for Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Figure 4 for Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

Abstract:Detecting out-of-distribution (OOD) examples is critical in many applications. We propose an unsupervised method to detect OOD samples using a $k$-NN density estimate with respect to a classification model's intermediate activations on in-distribution samples. We leverage a recent insight about label smoothing, which we call the \emph{Label Smoothed Embedding Hypothesis}, and show that one of the implications is that the $k$-NN density estimator performs better as an OOD detection method both theoretically and empirically when the model is trained with label smoothing. Finally, we show that our proposal outperforms many OOD baselines and also provide new finite-sample high-probability statistical results for $k$-NN density estimation's ability to detect OOD examples.

Via

Access Paper or Ask Questions

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

Dec 15, 2020

Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald Metzler, Aaron Courville

Figure 1 for StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

Figure 2 for StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

Figure 3 for StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

Figure 4 for StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

Abstract:There are two major classes of natural language grammars -- the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can induce dependency and constituency structure at the same time. To achieve this, we propose a new parsing framework that can jointly generate a constituency tree and dependency graph. Then we integrate the induced dependency relations into the transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing, and masked language modeling at the same time.

Via

Access Paper or Ask Questions

Long Range Arena: A Benchmark for Efficient Transformers

Nov 08, 2020

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

Figure 1 for Long Range Arena: A Benchmark for Efficient Transformers

Figure 2 for Long Range Arena: A Benchmark for Efficient Transformers

Figure 3 for Long Range Arena: A Benchmark for Efficient Transformers

Figure 4 for Long Range Arena: A Benchmark for Efficient Transformers

Abstract:Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.

Via

Access Paper or Ask Questions

Surprise: Result List Truncation via Extreme Value Theory

Oct 19, 2020

Dara Bahri, Che Zheng, Yi Tay, Donald Metzler, Andrew Tomkins

Figure 1 for Surprise: Result List Truncation via Extreme Value Theory

Figure 2 for Surprise: Result List Truncation via Extreme Value Theory

Figure 3 for Surprise: Result List Truncation via Extreme Value Theory

Figure 4 for Surprise: Result List Truncation via Extreme Value Theory

Abstract:Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user. The problem of result list truncation, or where to truncate the ranked list of results, however, has received less attention despite being crucial in a variety of applications. Such truncation is a balancing act between the overall relevance, or usefulness of the results, with the user cost of processing more results. Result list truncation can be challenging because relevance scores are often not well-calibrated. This is particularly true in large-scale IR systems where documents and queries are embedded in the same metric space and a query's nearest document neighbors are returned during inference. Here, relevance is inversely proportional to the distance between the query and candidate document, but what distance constitutes relevance varies from query to query and changes dynamically as more documents are added to the index. In this work, we propose Surprise scoring, a statistical method that leverages the Generalized Pareto distribution that arises in extreme value theory to produce interpretable and calibrated relevance scores at query time using nothing more than the ranked scores. We demonstrate its effectiveness on the result list truncation task across image, text, and IR datasets and compare it to both classical and recent baselines. We draw connections to hypothesis testing and $p$-values.

Via

Access Paper or Ask Questions

Efficient Transformers: A Survey

Sep 16, 2020

Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler

Figure 1 for Efficient Transformers: A Survey

Figure 2 for Efficient Transformers: A Survey

Figure 3 for Efficient Transformers: A Survey

Figure 4 for Efficient Transformers: A Survey

Abstract:Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of "X-former" models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency. With the aim of helping the avid researcher navigate this flurry, this paper characterizes a large and thoughtful selection of recent efficiency-flavored "X-former" models, providing an organized and comprehensive overview of existing work and models across multiple domains.

Via

Access Paper or Ask Questions