In this work we describe a method to identify document pairwise relevance in the context of a typical legal document collection: limited resources, long queries and long documents. We review the usage of generalized language models, including supervised and unsupervised learning. We observe how our method, while using text summaries, overperforms existing baselines based on full text, and motivate potential improvement directions for future work.
In this paper, we transform tag recommendation into a word-based text generation problem and introduce a sequence-to-sequence model. The model inherits the advantages of LSTM-based encoder for sequential modeling and attention-based decoder with local positional encodings for learning relations globally. Experimental results on Zhihu datasets illustrate the proposed model outperforms other state-of-the-art text classification based methods.
This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.
Machine translation (MT) is the process of translating text written in a source language into text in a target language. In this article, we present our English-Arabic statistical machine translation system. First, we present the general process for setting up a statistical machine translation system, then we describe the tools as well as the different corpora we used to build our MT system. Our system was evaluated in terms of the BLUE score (24.51%)
MRA (Multilingual Report Annotator) is a web application that translates Radiology text and annotates it with RadLex terms. Its goal is to explore the solution of translating non-English Radiology reports as a way to solve the problem of most of the Text Mining tools being developed for English. In this brief paper we explain the language barrier problem and shortly describe the application. MRA can be found at https://github.com/lasigeBioTM/MRA .
Given the present state of work in natural language processing, this address argues first, that advance in both science and applications requires a revival of concern about what language is about, broadly speaking the world; and second, that an attack on the summarising task, which is made ever more important by the growth of electronic text resources and requires an understanding of the role of large-scale discourse structure in marking important text content, is a good way forward.
Cormack (1992) proposed a framework for pronominal anaphora resolution. Her proposal integrates focusing theory (Sidner et al.) and DRT (Kamp and Reyle). We analyzed this methodology and adjusted it to the processing of Portuguese texts. The scope of the framework was widened to cover sentences containing restrictive relative clauses and subject ellipsis. Tests were conceived and applied to probe the adequacy of proposed modifications when dealing with processing of current texts.
This paper presents the results of research on supervised extractive text summarisation for scientific articles. We show that a simple sequential tagging model based only on the text within a document achieves high results against a simple classification model. Improvements can be achieved through additional sentence-level features, though these were minimal. Through further analysis, we show the potential of the sequential model relying on the structure of the document depending on the academic discipline which the document is from.
This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected. We describe the ULI dataset and how it was constructed using the Wanca 2017 corpus and texts in different languages from the Leipzig corpora collection. We also provide baseline language identification experiments conducted using the ULI 2020 dataset.
In this report, we present a study of eight corpora of online hate speech, by demonstrating the NLP techniques that we used to collect and analyze the jihadist, extremist, racist, and sexist content. Analysis of the multilingual corpora shows that the different contexts share certain characteristics in their hateful rhetoric. To expose the main features, we have focused on text classification, text profiling, keyword and collocation extraction, along with manual annotation and qualitative study.