Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox


Collecting Entailment Data for Pretraining: New Protocols and Negative Results

Add code

Apr 24, 2020
Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, Emily Pitler


Share this with someone who'll enjoy it:


Textual entailment (or NLI) data has proven useful as pretraining data for tasks requiring language understanding, even when building on an already-pretrained model like RoBERTa. The standard protocol for collecting NLI was not designed for the creation of pretraining data, and it is likely far from ideal for this purpose. With this application in mind, we propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a simple MNLI-based baseline, we collect and compare five new 8.5k-example training sets. Our primary results are solidly negative, with our baseline MNLI-style dataset yielding good transfer performance, but none of our four new methods (nor the recent ANLI) showing any improvements on that baseline. However, we do observe that all four of these interventions, especially the use of seed sentences for inspiration, reduce previously observed issues with annotation artifacts.



   Access Paper Source



Share this with someone who'll enjoy it: