Get our free extension to see links to code for papers anywhere online!

 Add to Chrome

 Add to Firefox

CatalyzeX Code Finder - Browser extension linking code for ML papers across the web! | Product Hunt Embed

User Generated Data: Achilles' heel of BERT

Mar 29, 2020
Ankit Kumar, Piyush Makhija, Anuj Gupta



Pre-trained language models such as BERT are known to perform exceedingly well on various NLP tasks and have even established new State-Of-The-Art (SOTA) benchmarks for many of these tasks. Owing to its success on various tasks and benchmark datasets, industry practitioners have started to explore BERT to build applications solving industry use cases. These use cases are known to have much more noise in the data as compared to benchmark datasets. In this work we systematically show that when the data is noisy, there is a significant degradation in the performance of BERT. Specifically, we performed experiments using BERT on popular tasks such sentiment analysis and textual similarity. For this we work with three well known datasets - IMDB movie reviews, SST-2 and STS-B to measure the performance. Further, we examine the reason behind this performance drop and identify the shortcomings in the BERT pipeline.

* 7 pages, 2 figures, 6 plots 


Share this with someone who'll enjoy it:

   Access Paper Source



Share this with someone who'll enjoy it: