Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A little goes a long way: Improving toxic language classification despite data scarcity

Sep 25, 2020

Mika Juuti, Tommi Gröndahl, Adrian Flanagan, N. Asokan

Figure 1 for A little goes a long way: Improving toxic language classification despite data scarcity

Figure 2 for A little goes a long way: Improving toxic language classification despite data scarcity

Figure 3 for A little goes a long way: Improving toxic language classification despite data scarcity

Figure 4 for A little goes a long way: Improving toxic language classification despite data scarcity

Share this with someone who'll enjoy it:

Abstract:Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation - generating new synthetic data from a labeled seed dataset - can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT - a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.

* Accepted for publication in Findings of ACL: EMNLP 2020

View paper on

Share this with someone who'll enjoy it:

Title:A little goes a long way: Improving toxic language classification despite data scarcity

Paper and Code