Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leon Derczynski

Training a T5 Using Lab-sized Resources

Aug 25, 2022

Manuel R. Ciosici, Leon Derczynski

Figure 1 for Training a T5 Using Lab-sized Resources

Figure 2 for Training a T5 Using Lab-sized Resources

Abstract:Training large neural language models on large datasets is resource- and time-intensive. These requirements create a barrier to entry, where those with fewer resources cannot build competitive models. This paper presents various techniques for making it possible to (a) train a large language model using resources that a modest research lab might have, and (b) train it in a reasonable amount of time. We provide concrete recommendations for practitioners, which we illustrate with a case study: a T5 model for Danish, the first for this language.

Via

Access Paper or Ask Questions

Sparse Probability of Agreement

Aug 12, 2022

Jeppe Nørregaard, Leon Derczynski

Figure 1 for Sparse Probability of Agreement

Figure 2 for Sparse Probability of Agreement

Figure 3 for Sparse Probability of Agreement

Figure 4 for Sparse Probability of Agreement

Abstract:Measuring inter-annotator agreement is important for annotation tasks, but many metrics require a fully-annotated dataset (or subset), where all annotators annotate all samples. We define Sparse Probability of Agreement, SPA, which estimates the probability of agreement when no all annotator-item-pairs are available. We show that SPA, with some assumptions, is an unbiased estimator and provide multiple different weighing schemes for handling samples with different numbers of annotation, evaluated over a range of datasets.

Via

Access Paper or Ask Questions

The ITU Faroese Pairs Dataset

Jun 17, 2022

Leon Derczynski, Annika Solveig Hedegaard Isfeldt, Signhild Djurhuus

Abstract:This article documents a dataset of sentence pairs between Faroese and Danish, produced at ITU Copenhagen. The data covers tranlsation from both source languages, and is intended for use as training data for machine translation systems in this language pair.

Via

Access Paper or Ask Questions

Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction

Jun 08, 2022

Mateusz Jurewicz, Leon Derczynski

Figure 1 for Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction

Figure 2 for Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction

Figure 3 for Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction

Figure 4 for Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction

Abstract:The task of learning to map an input set onto a permuted sequence of its elements is challenging for neural networks. Set-to-sequence problems occur in natural language processing, computer vision and structure prediction, where interactions between elements of large sets define the optimal output. Models must exhibit relational reasoning, handle varying cardinalities and manage combinatorial complexity. Previous attention-based methods require $n$ layers of their set transformations to explicitly represent $n$-th order relations. Our aim is to enhance their ability to efficiently model higher-order interactions through an additional interdependence component. We propose a novel neural set encoding method called the Set Interdependence Transformer, capable of relating the set's permutation invariant representation to its elements within sets of any cardinality. We combine it with a permutation learning module into a complete, 3-part set-to-sequence model and demonstrate its state-of-the-art performance on a number of tasks. These range from combinatorial optimization problems, through permutation learning challenges on both synthetic and established NLP datasets for sentence ordering, to a novel domain of product catalog structure prediction. Additionally, the network's ability to generalize to unseen sequence lengths is investigated and a comparative empirical analysis of the existing methods' ability to learn higher-order interactions is provided.

* Paper accepted for publication in the IJCAI-ECAI 2022 proceedings: https://www.ijcai.org/proceedings/

Via

Access Paper or Ask Questions

Bridging the Domain Gap for Stance Detection for the Zulu language

May 06, 2022

Gcinizwe Dlamini, Imad Eddine Ibrahim Bekkouch, Adil Khan, Leon Derczynski

Figure 1 for Bridging the Domain Gap for Stance Detection for the Zulu language

Figure 2 for Bridging the Domain Gap for Stance Detection for the Zulu language

Figure 3 for Bridging the Domain Gap for Stance Detection for the Zulu language

Figure 4 for Bridging the Domain Gap for Stance Detection for the Zulu language

Abstract:Misinformation has become a major concern in recent last years given its spread across our information sources. In the past years, many NLP tasks have been introduced in this area, with some systems reaching good results on English language datasets. Existing AI based approaches for fighting misinformation in literature suggest automatic stance detection as an integral first step to success. Our paper aims at utilizing this progress made for English to transfers that knowledge into other languages, which is a non-trivial task due to the domain gap between English and the target languages. We propose a black-box non-intrusive method that utilizes techniques from Domain Adaptation to reduce the domain gap, without requiring any human expertise in the target language, by leveraging low-quality data in both a supervised and unsupervised manner. This allows us to rapidly achieve similar results for stance detection for the Zulu language, the target language in this work, as are found for English. We also provide a stance detection dataset in the Zulu language. Our experimental results show that by leveraging English datasets and machine translation we can increase performances on both English data along with other languages.

* accepted to Intellisys

Via

Access Paper or Ask Questions

Handling and Presenting Harmful Text

Apr 29, 2022

Leon Derczynski, Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen

Figure 1 for Handling and Presenting Harmful Text

Abstract:Textual data can pose a risk of serious harm. These harms can be categorised along three axes: (1) the harm type (e.g. misinformation, hate speech or racial stereotypes) (2) whether it is \textit{elicited} as a feature of the research design from directly studying harmful content (e.g. training a hate speech classifier or auditing unfiltered large-scale datasets) versus \textit{spuriously} invoked from working on unrelated problems (e.g. language generation or part of speech tagging) but with datasets that nonetheless contain harmful content, and (3) who it affects, from the humans (mis)represented in the data to those handling or labelling the data to readers and reviewers of publications produced from the data. It is an unsolved problem in NLP as to how textual harms should be handled, presented, and discussed; but, stopping work on content which poses a risk of harm is untenable. Accordingly, we provide practical advice and introduce \textsc{HarmCheck}, a resource for reflecting on research into textual harms. We hope our work encourages ethical, responsible, and respectful research in the NLP community.

Via

Access Paper or Ask Questions

Detecting Abusive Albanian

Jul 30, 2021

Erida Nurce, Jorgel Keci, Leon Derczynski

Abstract:The ever growing usage of social media in the recent years has had a direct impact on the increased presence of hate speech and offensive speech in online platforms. Research on effective detection of such content has mainly focused on English and a few other widespread languages, while the leftover majority fail to have the same work put into them and thus cannot benefit from the steady advancements made in the field. In this paper we present \textsc{Shaj}, an annotated Albanian dataset for hate speech and offensive speech that has been constructed from user-generated content on various social media platforms. Its annotation follows the hierarchical schema introduced in OffensEval. The dataset is tested using three different classification models, the best of which achieves an F1 score of 0.77 for the identification of offensive language, 0.64 F1 score for the automatic categorization of offensive types and lastly, 0.52 F1 score for the offensive language target identification.

Via

Access Paper or Ask Questions

Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Apr 16, 2021

Magnus Jacobsen, Mikkel H. Sørensen, Leon Derczynski

Figure 1 for Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Figure 2 for Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Figure 3 for Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Figure 4 for Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Abstract:Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over part-of-speech tagging, we then apply these techniques to taggers for eight languages and present a novel analysis identifying which taggers are size-performance optimal. Results indicate that some classical taggers place on the size-performance skyline across languages. Further, although the deep models have highest performance for multiple scores, it is often not the most complex of these that reach peak performance.

Via

Access Paper or Ask Questions

Discriminating Between Similar Nordic Languages

Dec 11, 2020

René Haas, Leon Derczynski

Figure 1 for Discriminating Between Similar Nordic Languages

Figure 2 for Discriminating Between Similar Nordic Languages

Figure 3 for Discriminating Between Similar Nordic Languages

Figure 4 for Discriminating Between Similar Nordic Languages

Abstract:Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokm{\aa}l), Faroese and Icelandic.

Via

Access Paper or Ask Questions

Power Consumption Variation over Activation Functions

Jun 12, 2020

Leon Derczynski

Figure 1 for Power Consumption Variation over Activation Functions

Figure 2 for Power Consumption Variation over Activation Functions

Figure 3 for Power Consumption Variation over Activation Functions

Figure 4 for Power Consumption Variation over Activation Functions

Abstract:The power that machine learning models consume when making predictions can be affected by a model's architecture. This paper presents various estimates of power consumption for a range of different activation functions, a core factor in neural network model architecture design. Substantial differences in hardware performance exist between activation functions. This difference informs how power consumption in machine learning models can be reduced.

Via

Access Paper or Ask Questions