Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Douwe Kiela

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Oct 06, 2022

Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani, Victor Mustar, Helen Ngo(+9 more)

Figure 1 for Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Figure 2 for Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Figure 3 for Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Abstract:Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support reproducibility of evaluation, centralize and document the evaluation process, and broaden evaluation to cover more facets of model performance. It includes over 50 efficient canonical implementations for a variety of domains and scenarios, interactive documentation, and the ability to easily share implementations and outcomes. The library is available at https://github.com/huggingface/evaluate. In addition, we introduce Evaluation on the Hub, a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a button. Evaluation on the Hub is available at https://huggingface.co/autoevaluate.

Via

Access Paper or Ask Questions

DataPerf: Benchmarks for Data-Centric AI Development

Jul 20, 2022

Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Douwe Kiela, David Jurado(+26 more)

Figure 1 for DataPerf: Benchmarks for Data-Centric AI Development

Figure 2 for DataPerf: Benchmarks for Data-Centric AI Development

Figure 3 for DataPerf: Benchmarks for Data-Centric AI Development

Figure 4 for DataPerf: Benchmarks for Data-Centric AI Development

Abstract:Machine learning (ML) research has generally focused on models, while the most prominent datasets have been employed for everyday ML tasks without regard for the breadth, difficulty, and faithfulness of these datasets to the underlying problem. Neglecting the fundamental importance of datasets has caused major problems involving data cascades in real-world applications and saturation of dataset-driven criteria for model quality, hindering research growth. To solve this problem, we present DataPerf, a benchmark package for evaluating ML datasets and dataset-working algorithms. We intend it to enable the "data ratchet," in which training sets will aid in evaluating test sets on the same problems, and vice versa. Such a feedback-driven strategy will generate a virtuous loop that will accelerate development of data-centric AI. The MLCommons Association will maintain DataPerf.

Via

Access Paper or Ask Questions

Perturbation Augmentation for Fairer NLP

May 25, 2022

Rebecca Qian, Candace Ross, Jude Fernandes, Eric Smith, Douwe Kiela, Adina Williams

Figure 1 for Perturbation Augmentation for Fairer NLP

Figure 2 for Perturbation Augmentation for Fairer NLP

Figure 3 for Perturbation Augmentation for Fairer NLP

Figure 4 for Perturbation Augmentation for Fairer NLP

Abstract:Unwanted and often harmful social biases are becoming ever more salient in NLP research, affecting both models and datasets. In this work, we ask: does training on demographically perturbed data lead to more fair language models? We collect a large dataset of human annotated text perturbations and train an automatic perturber on it, which we show to outperform heuristic alternatives. We find: (i) Language models (LMs) pre-trained on demographically perturbed corpora are more fair, at least, according to our current best metrics for measuring model fairness, and (ii) LMs finetuned on perturbed GLUE datasets exhibit less demographic bias on downstream tasks. We find that improved fairness does not come at the expense of accuracy. Although our findings appear promising, there are still some limitations, as well as outstanding questions about how best to evaluate the (un)fairness of large language models. We hope that this initial exploration of neural demographic perturbation will help drive more improvement towards fairer NLP.

Via

Access Paper or Ask Questions

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Apr 07, 2022

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross

Figure 1 for Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Figure 2 for Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Figure 3 for Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Figure 4 for Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Abstract:We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

* CVPR 2022

Via

Access Paper or Ask Questions

Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

Apr 05, 2022

Tristan Thrush, Kushal Tirumala, Anmol Gupta, Max Bartolo, Pedro Rodriguez, Tariq Kane, William Gaviria Rojas, Peter Mattson, Adina Williams, Douwe Kiela

Figure 1 for Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

Figure 2 for Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

Figure 3 for Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

Figure 4 for Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

Abstract:We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP models, as well as for conducting model in the loop data collection with crowdworkers. Dynatask is integrated with Dynabench, a research platform for rethinking benchmarking in AI that facilitates human and model in the loop data collection and evaluation. To create a task, users only need to write a short task configuration file from which the relevant web interfaces and model hosting infrastructure are automatically generated. The system is available at https://dynabench.org/ and the full library can be found at https://github.com/facebookresearch/dynabench.

* ACL System Demos 2022

Via

Access Paper or Ask Questions

Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

Dec 16, 2021

Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, Douwe Kiela

Figure 1 for Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

Figure 2 for Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

Figure 3 for Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

Figure 4 for Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

Abstract:In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more costly per example. In this work, we examine if we can maintain the advantages of DADC, without suffering the additional cost. To that end, we introduce Generative Annotation Assistants (GAAs), generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely. We collect training datasets in twenty experimental settings and perform a detailed analysis of this approach for the task of extractive question answering (QA) for both standard and adversarial data collection. We demonstrate that GAAs provide significant efficiency benefits in terms of annotation speed, while leading to improved model fooling rates. In addition, we show that GAA-assisted data leads to higher downstream model performance on a variety of question answering tasks.

Via

Access Paper or Ask Questions

FLAVA: A Foundational Language And Vision Alignment Model

Dec 08, 2021

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

Figure 1 for FLAVA: A Foundational Language And Vision Alignment Model

Figure 2 for FLAVA: A Foundational Language And Vision Alignment Model

Figure 3 for FLAVA: A Foundational Language And Vision Alignment Model

Figure 4 for FLAVA: A Foundational Language And Vision Alignment Model

Abstract:State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

* 18 pages

Via

Access Paper or Ask Questions

Analyzing Dynamic Adversarial Training Data in the Limit

Oct 16, 2021

Eric Wallace, Adina Williams, Robin Jia, Douwe Kiela

Figure 1 for Analyzing Dynamic Adversarial Training Data in the Limit

Figure 2 for Analyzing Dynamic Adversarial Training Data in the Limit

Figure 3 for Analyzing Dynamic Adversarial Training Data in the Limit

Figure 4 for Analyzing Dynamic Adversarial Training Data in the Limit

Abstract:To create models that are robust across a wide range of test inputs, training datasets should include diverse examples that span numerous phenomena. Dynamic adversarial data collection (DADC), where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets. Prior work has shown that running DADC over 1-3 rounds can help models fix some error types, but it does not necessarily lead to better generalization beyond adversarial test data. We argue that running DADC over many rounds maximizes its training-time benefits, as the different rounds can together cover many of the task-relevant phenomena. We present the first study of longer-term DADC, where we collect 20 rounds of NLI examples for a small set of premise paragraphs, with both adversarial and non-adversarial approaches. Models trained on DADC examples make 26% fewer errors on our expert-curated test set compared to models trained on non-adversarial data. Our analysis shows that DADC yields examples that are more difficult, more lexically and syntactically diverse, and contain fewer annotation artifacts compared to non-adversarial examples.

Via

Access Paper or Ask Questions

What's Hidden in a One-layer Randomly Weighted Transformer?

Sep 08, 2021

Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael W. Mahoney

Figure 1 for What's Hidden in a One-layer Randomly Weighted Transformer?

Figure 2 for What's Hidden in a One-layer Randomly Weighted Transformer?

Figure 3 for What's Hidden in a One-layer Randomly Weighted Transformer?

Figure 4 for What's Hidden in a One-layer Randomly Weighted Transformer?

Abstract:We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for one-layer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pre-trained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods. We released the source code at https://github.com/sIncerass/one_layer_lottery_ticket.

* EMNLP 2021 (short)

Via

Access Paper or Ask Questions

Human-Adversarial Visual Question Answering

Jun 04, 2021

Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Wojciech Galuba, Devi Parikh, Douwe Kiela

Figure 1 for Human-Adversarial Visual Question Answering

Figure 2 for Human-Adversarial Visual Question Answering

Figure 3 for Human-Adversarial Visual Question Answering

Figure 4 for Human-Adversarial Visual Question Answering

Abstract:Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the model's predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art.

* 22 pages, 13 figures. First two authors contributed equally

Via

Access Paper or Ask Questions