Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leon Derczynski

SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Apr 20, 2017

Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, Arkaitz Zubiaga

Figure 1 for SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Figure 2 for SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Figure 3 for SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Figure 4 for SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Abstract:Media is full of false claims. Even Oxford Dictionaries named "post-truth" as the word of 2016. This makes it more important than ever to build systems that can identify the veracity of a story, and the kind of discourse there is around it. RumourEval is a SemEval shared task that aims to identify and handle rumours and reactions to them, in text. We present an annotation scheme, a large dataset covering multiple topics - each having their own families of claims and replies - and use these to pose two concrete challenges as well as the results achieved by participants on these challenges.

Via

Access Paper or Ask Questions

Generalisation in Named Entity Recognition: A Quantitative Analysis

Mar 07, 2017

Isabelle Augenstein, Leon Derczynski, Kalina Bontcheva

Figure 1 for Generalisation in Named Entity Recognition: A Quantitative Analysis

Figure 2 for Generalisation in Named Entity Recognition: A Quantitative Analysis

Figure 3 for Generalisation in Named Entity Recognition: A Quantitative Analysis

Figure 4 for Generalisation in Named Entity Recognition: A Quantitative Analysis

Abstract:Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation.

* Preprint, accepted to Computer Speech and Language

Via

Access Paper or Ask Questions

Desiderata for Vector-Space Word Representations

Aug 06, 2016

Leon Derczynski

Abstract:A plethora of vector-space representations for words is currently available, which is growing. These consist of fixed-length vectors containing real values, which represent a word. The result is a representation upon which the power of many conventional information processing and data mining techniques can be brought to bear, as long as the representations are designed with some forethought and fit certain constraints. This paper details desiderata for the design of vector space representations of words.

Via

Access Paper or Ask Questions

USFD: Twitter NER with Drift Compensation and Linked Data

Nov 10, 2015

Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva

Figure 1 for USFD: Twitter NER with Drift Compensation and Linked Data

Figure 2 for USFD: Twitter NER with Drift Compensation and Linked Data

Figure 3 for USFD: Twitter NER with Drift Compensation and Linked Data

Figure 4 for USFD: Twitter NER with Drift Compensation and Linked Data

Abstract:This paper describes a pilot NER system for Twitter, comprising the USFD system entry to the W-NUT 2015 NER shared task. The goal is to correctly label entities in a tweet dataset, using an inventory of ten types. We employ structured learning, drawing on gazetteers taken from Linked Data, and on unsupervised clustering features, and attempting to compensate for stylistic and topic drift - a key challenge in social media text. Our result is competitive; we provide an analysis of the components of our methodology, and an examination of the target dataset in the context of this task.

* Proceedings of the ACL Workshop on Noisy User-generated Text (2015), pp. 48--53
* Paper in ACL anthology: https://aclweb.org/anthology/W/W15/W15-4306.bib

Via

Access Paper or Ask Questions

Analysis of Named Entity Recognition and Linking for Tweets

Oct 27, 2014

Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, Kalina Bontcheva

Figure 1 for Analysis of Named Entity Recognition and Linking for Tweets

Figure 2 for Analysis of Named Entity Recognition and Linking for Tweets

Figure 3 for Analysis of Named Entity Recognition and Linking for Tweets

Figure 4 for Analysis of Named Entity Recognition and Linking for Tweets

Abstract:Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

* Information Processing & Management 51 (2), 32-49, 2014
* 35 pages, accepted to journal Information Processing and Management

Via

Access Paper or Ask Questions

TempEval-3: Evaluating Events, Time Expressions, and Temporal Relations

May 25, 2014

Naushad UzZaman, Hector Llorens, James Allen, Leon Derczynski, Marc Verhagen, James Pustejovsky

Figure 1 for TempEval-3: Evaluating Events, Time Expressions, and Temporal Relations

Abstract:We describe the TempEval-3 task which is currently in preparation for the SemEval-2013 evaluation exercise. The aim of TempEval is to advance research on temporal information processing. TempEval-3 follows on from previous TempEval events, incorporating: a three-part task structure covering event, temporal expression and temporal relation extraction; a larger dataset; and single overall task quality scores.

Via

Access Paper or Ask Questions

Clinical TempEval

Mar 19, 2014

Steven Bethard, Leon Derczynski, James Pustejovsky, Marc Verhagen

Abstract:We describe the Clinical TempEval task which is currently in preparation for the SemEval-2015 evaluation exercise. This task involves identifying and describing events, times and the relations between them in clinical text. Six discrete subtasks are included, focusing on recognising mentions of times and events, describing those mentions for both entity types, identifying the relation between an event and the document creation time, and identifying narrative container relations.

Via

Access Paper or Ask Questions

TimeML-strict: clarifying temporal annotation

Apr 26, 2013

Leon Derczynski, Hector Llorens, Naushad UzZaman

Abstract:TimeML is an XML-based schema for annotating temporal information over discourse. The standard has been used to annotate a variety of resources and is followed by a number of tools, the creation of which constitute hundreds of thousands of man-hours of research work. However, the current state of resources is such that many are not valid, or do not produce valid output, or contain ambiguous or custom additions and removals. Difficulties arising from these variances were highlighted in the TempEval-3 exercise, which included its own extra stipulations over conventional TimeML as a response. To unify the state of current resources, and to make progress toward easy adoption of its current incarnation ISO-TimeML, this paper introduces TimeML-strict: a valid, unambiguous, and easy-to-process subset of TimeML. We also introduce three resources -- a schema for TimeML-strict; a validator tool for TimeML-strict, so that one may ensure documents are in the correct form; and a repair tool that corrects common invalidating errors and adds disambiguating markup in order to convert documents from the laxer TimeML standard to TimeML-strict.

Via

Access Paper or Ask Questions

Question Answering Against Very-Large Text Collections

Apr 26, 2013

Leon Derczynski, Richard Shaw, Ben Solway, Jun Wang

Figure 1 for Question Answering Against Very-Large Text Collections

Figure 2 for Question Answering Against Very-Large Text Collections

Abstract:Question answering involves developing methods to extract useful information from large collections of documents. This is done with specialised search engines such as Answer Finder. The aim of Answer Finder is to provide an answer to a question rather than a page listing related documents that may contain the correct answer. So, a question such as "How tall is the Eiffel Tower" would simply return "325m" or "1,063ft". Our task was to build on the current version of Answer Finder by improving information retrieval, and also improving the pre-processing involved in question series analysis.

* Master's theses, 2008, University of Sheffield

Via

Access Paper or Ask Questions

A Data Driven Approach to Query Expansion in Question Answering

Mar 22, 2012

Leon Derczynski, Jun Wang, Robert Gaizauskas, Mark A. Greenwood

Figure 1 for A Data Driven Approach to Query Expansion in Question Answering

Figure 2 for A Data Driven Approach to Query Expansion in Question Answering

Figure 3 for A Data Driven Approach to Query Expansion in Question Answering

Figure 4 for A Data Driven Approach to Query Expansion in Question Answering

Abstract:Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

* Proc. IR4QA Workshop (2008) 34-41

Via

Access Paper or Ask Questions