Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew Shardlow

Lexical Complexity Prediction: An Overview

Mar 08, 2023

Kai North, Marcos Zampieri, Matthew Shardlow

Figure 1 for Lexical Complexity Prediction: An Overview

Figure 2 for Lexical Complexity Prediction: An Overview

Figure 3 for Lexical Complexity Prediction: An Overview

Figure 4 for Lexical Complexity Prediction: An Overview

Abstract:The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this paper, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data. We survey relevant approaches to this problem which include traditional machine learning classifiers (e.g. SVMs, logistic regression) and deep neural networks as well as a variety of features, such as those inspired by literature in psycholinguistics as well as word frequency, word length, and many others. Furthermore, we introduce readers to past competitions and available datasets created on this topic. Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English.

* ACM Computing Surveys 55, 9, Article 179 (January 2023), 40 pages

Via

Access Paper or Ask Questions

Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

Feb 06, 2023

Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng Sheang, Matthew Shardlow, Kai North, Marcos Zampieri

Figure 1 for Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

Figure 2 for Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

Figure 3 for Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

Figure 4 for Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

Abstract:We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part of the Workshop on Text Simplification, Accessibility, and Readability TSAR-2022 held in conjunction with EMNLP 2022. The task called the Natural Language Processing research community to contribute with methods to advance the state of the art in multilingual lexical simplification for English, Portuguese, and Spanish. A total of 14 teams submitted the results of their lexical simplification systems for the provided test data. Results of the shared task indicate new benchmarks in Lexical Simplification with English lexical simplification quantitative results noticeably higher than those obtained for Spanish and (Brazilian) Portuguese.

Via

Access Paper or Ask Questions

Deanthropomorphising NLP: Can a Language Model Be Conscious?

Nov 21, 2022

Matthew Shardlow, Piotr Przybyła

Abstract:This work is intended as a voice in the discussion over the recent claims that LaMDA, a pretrained language model based on the Transformer model architecture, is sentient. This claim, if confirmed, would have serious ramifications in the Natural Language Processing (NLP) community due to wide-spread use of similar models. However, here we take the position that such a language model cannot be sentient, or conscious, and that LaMDA in particular exhibits no advances over other similar models that would qualify it. We justify this by analysing the Transformer architecture through Integrated Information Theory. We see the claims of consciousness as part of a wider tendency to use anthropomorphic language in NLP reporting. Regardless of the veracity of the claims, we consider this an opportune moment to take stock of progress in language modelling and consider the ethical implications of the task. In order to make this work helpful for readers outside the NLP community, we also present the necessary background in language modelling.

Via

Access Paper or Ask Questions

Lexical Simplification Benchmarks for English, Portuguese, and Spanish

Sep 12, 2022

Sanja Stajner, Daniel Ferres, Matthew Shardlow, Kai North, Marcos Zampieri, Horacio Saggion

Figure 1 for Lexical Simplification Benchmarks for English, Portuguese, and Spanish

Figure 2 for Lexical Simplification Benchmarks for English, Portuguese, and Spanish

Figure 3 for Lexical Simplification Benchmarks for English, Portuguese, and Spanish

Figure 4 for Lexical Simplification Benchmarks for English, Portuguese, and Spanish

Abstract:Even in highly-developed countries, as many as 15-30\% of the population can only understand texts written using a basic vocabulary. Their understanding of everyday texts is limited, which prevents them from taking an active role in society and making informed decisions regarding healthcare, legal representation, or democratic choice. Lexical simplification is a natural language processing task that aims to make text understandable to everyone by replacing complex vocabulary and expressions with simpler ones, while preserving the original meaning. It has attracted considerable attention in the last 20 years, and fully automatic lexical simplification systems have been proposed for various languages. The main obstacle for the progress of the field is the absence of high-quality datasets for building and evaluating lexical simplification systems. We present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese, and provide details about data selection and annotation procedures. This is the first dataset that offers a direct comparison of lexical simplification systems for three languages. To showcase the usability of the dataset, we adapt two state-of-the-art lexical simplification systems with differing architectures (neural vs.\ non-neural) to all three languages (English, Spanish, and Brazilian Portuguese) and evaluate their performances on our new dataset. For a fairer comparison, we use several evaluation measures which capture varied aspects of the systems' efficacy, and discuss their strengths and weaknesses. We find a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages. More importantly, we find that the state-of-the-art neural lexical simplification systems perform significantly better for English than for Spanish and Portuguese.

Via

Access Paper or Ask Questions

Investigating Text Simplification Evaluation

Jul 28, 2021

Laura Vásquez-Rodríguez, Matthew Shardlow, Piotr Przybyła, Sophia Ananiadou

Figure 1 for Investigating Text Simplification Evaluation

Figure 2 for Investigating Text Simplification Evaluation

Figure 3 for Investigating Text Simplification Evaluation

Figure 4 for Investigating Text Simplification Evaluation

Abstract:Modern text simplification (TS) heavily relies on the availability of gold standard data to build machine learning models. However, existing studies show that parallel TS corpora contain inaccurate simplifications and incorrect alignments. Additionally, evaluation is usually performed by using metrics such as BLEU or SARI to compare system output to the gold standard. A major limitation is that these metrics do not match human judgements and the performance on different datasets and linguistic phenomena vary greatly. Furthermore, our research shows that the test and training subsets of parallel datasets differ significantly. In this work, we investigate existing TS corpora, providing new insights that will motivate the improvement of existing state-of-the-art TS evaluation methods. Our contributions include the analysis of TS corpora based on existing modifications used for simplification and an empirical study on TS models performance by using better-distributed datasets. We demonstrate that by improving the distribution of TS datasets, we can build more robust TS models.

* Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 876-882
* 7 pages, 3 figures, 1 table

Via

Access Paper or Ask Questions

SemEval-2021 Task 1: Lexical Complexity Prediction

Jun 01, 2021

Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, Marcos Zampieri

Figure 1 for SemEval-2021 Task 1: Lexical Complexity Prediction

Figure 2 for SemEval-2021 Task 1: Lexical Complexity Prediction

Figure 3 for SemEval-2021 Task 1: Lexical Complexity Prediction

Figure 4 for SemEval-2021 Task 1: Lexical Complexity Prediction

Abstract:This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.

Via

Access Paper or Ask Questions

Predicting Lexical Complexity in English Texts

Feb 17, 2021

Matthew Shardlow, Richard Evans, Marcos Zampieri

Figure 1 for Predicting Lexical Complexity in English Texts

Figure 2 for Predicting Lexical Complexity in English Texts

Figure 3 for Predicting Lexical Complexity in English Texts

Figure 4 for Predicting Lexical Complexity in English Texts

Abstract:The first step in most text simplification is to predict which words are considered complex for a given target population before carrying out lexical substitution. This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of complex word identification datasets for English.

Via

Access Paper or Ask Questions

Detecting Multiword Expression Type Helps Lexical Complexity Assessment

May 12, 2020

Ekaterina Kochmar, Sian Gooding, Matthew Shardlow

Figure 1 for Detecting Multiword Expression Type Helps Lexical Complexity Assessment

Figure 2 for Detecting Multiword Expression Type Helps Lexical Complexity Assessment

Figure 3 for Detecting Multiword Expression Type Helps Lexical Complexity Assessment

Figure 4 for Detecting Multiword Expression Type Helps Lexical Complexity Assessment

Abstract:Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature. Multiple NLP applications have been shown to benefit from MWE identification, however the research on lexical complexity of MWEs is still an under-explored area. In this work, we re-annotate the Complex Word Identification Shared Task 2018 dataset of Yimam et al. (2017), which provides complexity scores for a range of lexemes, with the types of MWEs. We release the MWE-annotated dataset with this paper, and we believe this dataset represents a valuable resource for the text simplification community. In addition, we investigate which types of expressions are most problematic for native and non-native readers. Finally, we show that a lexical complexity assessment system benefits from the information about MWE types.

* Accepted for publication at LREC 2020

Via

Access Paper or Ask Questions

CompLex --- A New Corpus for Lexical Complexity Predicition from Likert Scale Data

Mar 16, 2020

Matthew Shardlow, Michael Cooper, Marcos Zampieri

Figure 1 for CompLex --- A New Corpus for Lexical Complexity Predicition from Likert Scale Data

Figure 2 for CompLex --- A New Corpus for Lexical Complexity Predicition from Likert Scale Data

Figure 3 for CompLex --- A New Corpus for Lexical Complexity Predicition from Likert Scale Data

Abstract:Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.

Via

Access Paper or Ask Questions