Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jason Phang

Tony

QuALITY: Question Answering with Long Input Texts, Yes!

Dec 16, 2021

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He(+1 more)

Figure 1 for QuALITY: Question Answering with Long Input Texts, Yes!

Figure 2 for QuALITY: Question Answering with Long Input Texts, Yes!

Figure 3 for QuALITY: Question Answering with Long Input Texts, Yes!

Figure 4 for QuALITY: Question Answering with Long Input Texts, Yes!

Abstract:To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Current models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).

Via

Access Paper or Ask Questions

Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Nov 16, 2021

Jason Phang, Angelica Chen, William Huang, Samuel R. Bowman

Figure 1 for Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Figure 2 for Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Figure 3 for Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Figure 4 for Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

Abstract:More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans. This has left little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this work, we study the impact of applying each approach to create more challenging evaluation datasets. We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models. We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used. However, the resulting ranking of models can also be unstable and highly sensitive to the choice of adversary model used. Moreover, AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the most contentiously labeled examples. Smaller-scale experiments on the adversarially collected datasets ANLI and AdversarialQA show similar findings, broadly lowering performance with stronger adversaries while disproportionately affecting the adversary model.

Via

Access Paper or Ask Questions

BBQ: A Hand-Built Bias Benchmark for Question Answering

Oct 15, 2021

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman

Figure 1 for BBQ: A Hand-Built Bias Benchmark for Question Answering

Figure 2 for BBQ: A Hand-Built Bias Benchmark for Question Answering

Figure 3 for BBQ: A Hand-Built Bias Benchmark for Question Answering

Figure 4 for BBQ: A Hand-Built Bias Benchmark for Question Answering

Abstract:It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight \textit{attested} social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two distinct levels: (i) given an under-informative context, test how strongly model answers reflect social biases, and (ii) given an adequately informative context, test whether the model's biases still override a correct answer choice. We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting. Though models are much more accurate when the context provides an unambiguous answer, they still rely on stereotyped information and achieve an accuracy 2.5 percentage points higher on examples where the correct answer aligns with a social bias, with this accuracy difference widening to 5 points for examples targeting gender.

* 16 pages, 9 figures

Via

Access Paper or Ask Questions

Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Sep 20, 2021

Jason Phang, Haokun Liu, Samuel R. Bowman

Figure 1 for Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Figure 2 for Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Figure 3 for Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Figure 4 for Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Abstract:Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance, even with no further tuning.

* BlackboxNLP 2021

Via

Access Paper or Ask Questions

Comparing Test Sets with Item Response Theory

Jun 01, 2021

Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, Samuel R. Bowman

Figure 1 for Comparing Test Sets with Item Response Theory

Figure 2 for Comparing Test Sets with Item Response Theory

Figure 3 for Comparing Test Sets with Item Response Theory

Figure 4 for Comparing Test Sets with Item Response Theory

Abstract:Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.

* ACL 2021

Via

Access Paper or Ask Questions

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Dec 31, 2020

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima(+2 more)

Figure 1 for The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Figure 2 for The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Figure 3 for The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Figure 4 for The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Abstract:Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

Via

Access Paper or Ask Questions

Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability

Oct 19, 2020

Jason Phang, Jungkyu Park, Krzysztof J. Geras

Figure 1 for Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability

Figure 2 for Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability

Figure 3 for Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability

Figure 4 for Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability

Abstract:Saliency maps that identify the most informative regions of an image for a classifier are valuable for model interpretability. A common approach to creating saliency maps involves generating input masks that mask out portions of an image to maximally deteriorate classification performance, or mask in an image to preserve classification performance. Many variants of this approach have been proposed in the literature, such as counterfactual generation and optimizing over a Gumbel-Softmax distribution. Using a general formulation of masking-based saliency methods, we conduct an extensive evaluation study of a number of recently proposed variants to understand which elements of these methods meaningfully improve performance. Surprisingly, we find that a well-tuned, relatively simple formulation of a masking-based saliency model outperforms many more complex approaches. We find that the most important ingredients for high quality saliency map generation are (1) using both masked-in and masked-out objectives and (2) training the classifier alongside the masking model. Strikingly, we show that a masking model can be trained with as few as 10 examples per class and still generate saliency maps with only a 0.7-point increase in localization error.

Via

Access Paper or Ask Questions

Reducing false-positive biopsies with deep neural networks that utilize local and global information in screening mammograms

Sep 19, 2020

Nan Wu, Zhe Huang, Yiqiu Shen, Jungkyu Park, Jason Phang, Taro Makino, S. Gene Kim, Kyunghyun Cho, Laura Heacock, Linda Moy(+1 more)

Figure 1 for Reducing false-positive biopsies with deep neural networks that utilize local and global information in screening mammograms

Figure 2 for Reducing false-positive biopsies with deep neural networks that utilize local and global information in screening mammograms

Figure 3 for Reducing false-positive biopsies with deep neural networks that utilize local and global information in screening mammograms

Figure 4 for Reducing false-positive biopsies with deep neural networks that utilize local and global information in screening mammograms

Abstract:Breast cancer is the most common cancer in women, and hundreds of thousands of unnecessary biopsies are done around the world at a tremendous cost. It is crucial to reduce the rate of biopsies that turn out to be benign tissue. In this study, we build deep neural networks (DNNs) to classify biopsied lesions as being either malignant or benign, with the goal of using these networks as second readers serving radiologists to further reduce the number of false positive findings. We enhance the performance of DNNs that are trained to learn from small image patches by integrating global context provided in the form of saliency maps learned from the entire image into their reasoning, similar to how radiologists consider global context when evaluating areas of interest. Our experiments are conducted on a dataset of 229,426 screening mammography exams from 141,473 patients. We achieve an AUC of 0.8 on a test set consisting of 464 benign and 136 malignant lesions.

Via

Access Paper or Ask Questions

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

May 26, 2020

Jason Phang, Phu Mon Htut, Yada Pruksachatkun, Haokun Liu, Clara Vania, Katharina Kann, Iacer Calixto, Samuel R. Bowman

Figure 1 for English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

Figure 2 for English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

Figure 3 for English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

Figure 4 for English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

Abstract:Intermediate-task training has been shown to substantially improve pretrained model performance on many language understanding tasks, at least in monolingual English settings. Here, we investigate whether English intermediate-task training is still helpful on non-English target tasks in a zero-shot cross-lingual setting. Using a set of 7 intermediate language understanding tasks, we evaluate intermediate-task transfer in a zero-shot cross-lingual setting on 9 target tasks from the XTREME benchmark. Intermediate-task training yields large improvements on the BUCC and Tatoeba tasks that use model representations directly without training, and moderate improvements on question-answering target tasks. Using SQuAD for intermediate training achieves the best results across target tasks, with an average improvement of 8.4 points on development sets. Selecting the best intermediate task model for each target task, we obtain a 6.1 point improvement over XLM-R Large on the XTREME benchmark, setting a new state of the art. Finally, we show that neither multi-task intermediate-task training nor continuing multilingual MLM during intermediate-task training offer significant improvements.

Via

Access Paper or Ask Questions

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

May 09, 2020

Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman

Figure 1 for Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

Figure 2 for Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

Figure 3 for Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

Figure 4 for Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

Abstract:While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.

* ACL 2020

Via

Access Paper or Ask Questions