Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nikita Nangia

Shammie

Does Putting a Linguist in the Loop Improve NLU Data Collection?

Apr 15, 2021

Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia, Alex Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen, Samuel R. Bowman

Figure 1 for Does Putting a Linguist in the Loop Improve NLU Data Collection?

Figure 2 for Does Putting a Linguist in the Loop Improve NLU Data Collection?

Figure 3 for Does Putting a Linguist in the Loop Improve NLU Data Collection?

Figure 4 for Does Putting a Linguist in the Loop Improve NLU Data Collection?

Abstract:Many crowdsourced NLP datasets contain systematic gaps and biases that are identified only after data collection is complete. Identifying these issues from early data samples during crowdsourcing should make mitigation more efficient, especially when done iteratively. We take natural language inference as a test case and ask whether it is beneficial to put a linguist `in the loop' during data collection to dynamically identify and address gaps in the data by introducing novel constraints on the task. We directly compare three data collection protocols: (i) a baseline protocol, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the task, and (iii) an extension of linguist-in-the-loop that provides direct interaction between linguists and crowdworkers via a chatroom. The datasets collected with linguist involvement are more reliably challenging than baseline, without loss of quality. But we see no evidence that using this data in training leads to better out-of-domain model performance, and the addition of a chat platform has no measurable effect on the resulting dataset. We suggest integrating expert analysis \textit{during} data collection so that the expert can dynamically address gaps and biases in the dataset.

* 14 pages, 10 figures

Via

Access Paper or Ask Questions

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Sep 30, 2020

Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman

Figure 1 for CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Figure 2 for CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Figure 3 for CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Figure 4 for CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Abstract:Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US, we introduce the Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs). CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs a model is presented with two sentences: one that is more stereotyping and another that is less stereotyping. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups. We find that all three of the widely-used MLMs we evaluate substantially favor sentences that express stereotypes in every category in CrowS-Pairs. As work on building less biased models advances, this dataset can be used as a benchmark to evaluate progress.

* EMNLP 2020

Via

Access Paper or Ask Questions

Natural Language Understanding with the Quora Question Pairs Dataset

Jul 01, 2019

Lakshay Sharma, Laura Graesser, Nikita Nangia, Utku Evci

Figure 1 for Natural Language Understanding with the Quora Question Pairs Dataset

Figure 2 for Natural Language Understanding with the Quora Question Pairs Dataset

Figure 3 for Natural Language Understanding with the Quora Question Pairs Dataset

Figure 4 for Natural Language Understanding with the Quora Question Pairs Dataset

Abstract:This paper explores the task Natural Language Understanding (NLU) by looking at duplicate question detection in the Quora dataset. We conducted extensive exploration of the dataset and used various machine learning models, including linear and tree-based models. Our final finding was that a simple Continuous Bag of Words neural network model had the best performance, outdoing more complicated recurrent and attention based models. We also conducted error analysis and found some subjectivity in the labeling of the dataset.

Via

Access Paper or Ask Questions

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Jun 01, 2019

Nikita Nangia, Samuel R. Bowman

Figure 1 for Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Figure 2 for Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Figure 3 for Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Abstract:The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.

* ACL 2019

Via

Access Paper or Ask Questions

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

May 02, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

Figure 1 for SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Figure 2 for SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Figure 3 for SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Figure 4 for SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Abstract:In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently come close to the level of non-expert humans, suggesting limited headroom for further research. This paper recaps lessons learned from the GLUE benchmark and presents SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. SuperGLUE will be available soon at super.gluebenchmark.com.

* super.gluebenchmark.com

Via

Access Paper or Ask Questions

ListOps: A Diagnostic Dataset for Latent Tree Learning

Apr 17, 2018

Nikita Nangia, Samuel R. Bowman

Figure 1 for ListOps: A Diagnostic Dataset for Latent Tree Learning

Figure 2 for ListOps: A Diagnostic Dataset for Latent Tree Learning

Figure 3 for ListOps: A Diagnostic Dataset for Latent Tree Learning

Figure 4 for ListOps: A Diagnostic Dataset for Latent Tree Learning

Abstract:Latent tree learning models learn to parse a sentence without syntactic supervision, and use that parse to build the sentence representation. Existing work on such models has shown that, while they perform well on tasks like sentence classification, they do not learn grammars that conform to any plausible semantic or syntactic formalism (Williams et al., 2018a). Studying the parsing ability of such models in natural language can be challenging due to the inherent complexities of natural language, like having several valid parses for a single sentence. In this paper we introduce ListOps, a toy dataset created to study the parsing ability of latent tree models. ListOps sequences are in the style of prefix arithmetic. The dataset is designed to have a single correct parsing strategy that a system needs to learn to succeed at the task. We show that the current leading latent tree models are unable to learn to parse and succeed at ListOps. These models achieve accuracies worse than purely sequential RNNs.

* 8 pages, 4 figures, 3 tables, NAACL-SRW (2018)

Via

Access Paper or Ask Questions

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Feb 19, 2018

Adina Williams, Nikita Nangia, Samuel R. Bowman

Figure 1 for A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Figure 2 for A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Figure 3 for A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Figure 4 for A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Abstract:This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. In addition to being one of the largest corpora available for the task of NLI, at 433k examples, this corpus improves upon available resources in its coverage: it offers data from ten distinct genres of written and spoken English--making it possible to evaluate systems on nearly the full complexity of the language--and it offers an explicit setting for the evaluation of cross-genre domain adaptation.

* 10 pages, 1 figures, 5 tables. v2 corrects a misreported accuracy number for the CBOW model in the 'matched' setting. v3 adds a discussion of the difficulty of the corpus to the analysis section. v4 is the version that was accepted to NAACL2018

Via

Access Paper or Ask Questions

The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations

Jul 25, 2017

Nikita Nangia, Adina Williams, Angeliki Lazaridou, Samuel R. Bowman

Figure 1 for The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations

Figure 2 for The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations

Figure 3 for The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations

Figure 4 for The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations

Abstract:This paper presents the results of the RepEval 2017 Shared Task, which evaluated neural network sentence representation learning models on the Multi-Genre Natural Language Inference corpus (MultiNLI) recently introduced by Williams et al. (2017). All of the five participating teams beat the bidirectional LSTM (BiLSTM) and continuous bag of words baselines reported in Williams et al.. The best single model used stacked BiLSTMs with residual connections to extract sentence features and reached 74.5% accuracy on the genre-matched test set. Surprisingly, the results of the competition were fairly consistent across the genre-matched and genre-mismatched test sets, and across subsets of the test data representing a variety of linguistic phenomena, suggesting that all of the submitted systems learned reasonably domain-independent representations for sentence meaning.

* 10 pages, 1 figure, 6 tables, in Proceedings of The Second Workshop on Evaluating Vector Space Representations for NLP (RepEval 2017)

Via

Access Paper or Ask Questions