"Text": models, code, and papers

Legal Search in Case Law and Statute Law

Aug 23, 2021
Julien Rossi, Evangelos Kanoulas

In this work we describe a method to identify document pairwise relevance in the context of a typical legal document collection: limited resources, long queries and long documents. We review the usage of generalized language models, including supervised and unsupervised learning. We observe how our method, while using text summaries, overperforms existing baselines based on full text, and motivate potential improvement directions for future work.

Tag Recommendation by Word-Level Tag Sequence Modeling

Nov 30, 2019
Xuewen Shi, Heyan Huang, Shuyang Zhao, Ping Jian, Yi-Kun Tang

In this paper, we transform tag recommendation into a word-based text generation problem and introduce a sequence-to-sequence model. The model inherits the advantages of LSTM-based encoder for sequential modeling and attention-based decoder with local positional encodings for learning relations globally. Experimental results on Zhihu datasets illustrate the proposed model outperforms other state-of-the-art text classification based methods.

* This is a full length version of the paper in DASFAA 2019 

Verifying Heaps' law using Google Books Ngram data

Dec 29, 2016
Vladimir V. Bochkarev, Eduard Yu. Lerner, Anna V. Shevlyakova

This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.

* 8 pages, 6 figures 

Système de traduction automatique statistique Anglais-Arabe

Feb 06, 2018
Marwa Hadj Salah, Didier Schwab, Hervé Blanchon, Mounir Zrigui

Machine translation (MT) is the process of translating text written in a source language into text in a target language. In this article, we present our English-Arabic statistical machine translation system. First, we present the general process for setting up a statistical machine translation system, then we describe the tools as well as the different corpora we used to build our MT system. Our system was evaluated in terms of the BLUE score (24.51%)

* in French 

MRA - Proof of Concept of a Multilingual Report Annotator Web Application

Apr 12, 2017
Luís Campos, Francisco Couto

MRA (Multilingual Report Annotator) is a web application that translates Radiology text and annotates it with RadLex terms. Its goal is to explore the solution of translating non-English Radiology reports as a way to solve the problem of most of the Text Mining tools being developed for English. In this brief paper we explain the language barrier problem and shortly describe the application. MRA can be found at .

Natural language processing: she needs something old and something new (maybe something borrowed and something blue, too)

Dec 21, 1995
Karen Sparck Jones

Given the present state of work in natural language processing, this address argues first, that advance in both science and applications requires a revival of concern about what language is about, broadly speaking the world; and second, that an attack on the summarising task, which is made ever more important by the growth of electronic text resources and requires an understanding of the role of large-scale discourse structure in marking important text content, is a good way forward.

* Presidential Address, 1994, Association for Computational Linguistics 

Extending DRT with a Focusing Mechanism for Pronominal Anaphora and Ellipsis Resolution

Nov 09, 1994
Jose Abracos, Jose Gabriel Lopes

Cormack (1992) proposed a framework for pronominal anaphora resolution. Her proposal integrates focusing theory (Sidner et al.) and DRT (Kamp and Reyle). We analyzed this methodology and adjusted it to the processing of Portuguese texts. The scope of the framework was widened to cover sentences containing restrictive relative clauses and subject ellipsis. Tests were conceived and applied to probe the adequacy of proposed modifications when dealing with processing of current texts.

Sequence-Based Extractive Summarisation for Scientific Articles

Apr 07, 2022
Daniel Kershaw, Rob Koeling

This paper presents the results of research on supervised extractive text summarisation for scientific articles. We show that a simple sequential tagging model based only on the text within a document achieves high results against a simple classification model. Improvements can be achieved through additional sentence-level features, though these were minimal. Through further analysis, we show the potential of the sequential model relying on the structure of the document depending on the academic discipline which the document is from.

* 7 pages 

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

Aug 27, 2020
Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected. We describe the ULI dataset and how it was constructed using the Wanca 2017 corpus and texts in different languages from the Leipzig corpora collection. We also provide baseline language identification experiments conducted using the ULI 2020 dataset.

Multilingual Cross-domain Perspectives on Online Hate Speech

Sep 11, 2018
Tom De Smedt, Sylvia Jaki, Eduan Kotzé, Leïla Saoud, Maja Gwóźdź, Guy De Pauw, Walter Daelemans

In this report, we present a study of eight corpora of online hate speech, by demonstrating the NLP techniques that we used to collect and analyze the jihadist, extremist, racist, and sexist content. Analysis of the multilingual corpora shows that the different contexts share certain characteristics in their hateful rhetoric. To expose the main features, we have focused on text classification, text profiling, keyword and collocation extraction, along with manual annotation and qualitative study.

* CLiPS Technical Report Series 8 (2018) 1-24 
* 24 pages 

