Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Sep 14, 2020
Kumiko Tanaka-Ishii, Shuntaro Takahashi

This article considers the fluctuation analysis methods of Taylor and Ebeling & Neiman. While both have been applied to various phenomena in the statistical mechanics domain, their similarities and differences have not been clarified. After considering their analytical aspects, this article presents a large-scale application of these methods to text. It is found that both methods can distinguish real text from independently and identically distributed (i.i.d.) sequences. Furthermore, it is found that the Taylor exponents acquired from words can roughly distinguish text categories; this is also the case for Ebeling and Neiman exponents, but to a lesser extent. Additionally, both methods show some possibility of capturing script kinds.

* Fractals, in 2021, No.2. 

  Access Paper or Ask Questions

NUAA-QMUL at SemEval-2020 Task 8: Utilizing BERT and DenseNet for Internet Meme Emotion Analysis

Nov 09, 2020
Xiaoyu Guo, Jing Ma, Arkaitz Zubiaga

This paper describes our contribution to SemEval 2020 Task 8: Memotion Analysis. Our system learns multi-modal embeddings from text and images in order to classify Internet memes by sentiment. Our model learns text embeddings using BERT and extracts features from images with DenseNet, subsequently combining both features through concatenation. We also compare our results with those produced by DenseNet, ResNet, BERT, and BERT-ResNet. Our results show that image classification models have the potential to help classifying memes, with DenseNet outperforming ResNet. Adding text features is however not always helpful for Memotion Analysis.

  Access Paper or Ask Questions

HyperText: Endowing FastText with Hyperbolic Geometry

Oct 30, 2020
Yudong Zhu, Di Zhou, Jinghui Xiao, Xin Jiang, Xiao Chen, Qun Liu

Natural language data exhibit tree-like hierarchical structures such as the hypernym-hyponym relations in WordNet. FastText, as the state-of-the-art text classifier based on shallow neural network in Euclidean space, may not model such hierarchies precisely with limited representation capacity. Considering that hyperbolic space is naturally suitable for modeling tree-like hierarchical data, we propose a new model named HyperText for efficient text classification by endowing FastText with hyperbolic geometry. Empirically, we show that HyperText outperforms FastText on a range of text classification tasks with much reduced parameters.

* Findings of EMNLP 2020 

  Access Paper or Ask Questions

Generating Knowledge Graph Paths from Textual Definitions using Sequence-to-Sequence Models

Apr 05, 2019
Victor Prokhorov, Mohammad Taher Pilehvar, Nigel Collier

We present a novel method for mapping unrestricted text to knowledge graph entities by framing the task as a sequence-to-sequence problem. Specifically, given the encoded state of an input text, our decoder directly predicts paths in the knowledge graph, starting from the root and ending at the target node following hypernym-hyponym relationships. In this way, and in contrast to other text-to-entity mapping systems, our model outputs hierarchically structured predictions that are fully interpretable in the context of the underlying ontology, in an end-to-end manner. We present a proof-of-concept experiment with encouraging results, comparable to those of state-of-the-art systems.

* accepted at naacl 2019 

  Access Paper or Ask Questions

Diseño de un espacio semántico sobre la base de la Wikipedia. Una propuesta de análisis de la semántica latente para el idioma español

Jan 28, 2019
Dalina Aidee Villa, Igor Barahona, Luis Javier Álvarez

Latent Semantic Analysis (LSA) was initially conceived by the cognitive psychology at the 90s decade. Since its emergence, the LSA has been used to model cognitive processes, pointing out academic texts, compare literature works and analyse political speeches, among other applications. Taking as starting point multivariate method for dimensionality reduction, this paper propose a semantic space for Spanish language. Out results include a document text matrix with dimensions 1.3 x10^6 and 5.9x10^6, which later is decomposed into singular values. Those singular values are used to semantically words or text.

* 14 pages, in Spanish, 4 figures 

  Access Paper or Ask Questions

Integer-Programming Ensemble of Temporal-Relations Classifiers

Jul 30, 2018
Catherine Kerr, Terri Hoare, Paula Carroll, Jakub Marecek

The extraction and understanding of temporal events and their relations are major challenges in natural language processing. Processing text on a sentence-by-sentence or expression-by-expression basis often fails, in part due to the challenge of capturing the global consistency of the text. We present an ensemble method, which reconciles the outputs of multiple classifiers of temporal expressions across the text using integer programming. Computational experiments show that the ensemble improves upon the best individual results from two recent challenges, SemEval-2013 TempEval-3 (Temporal Annotation) and SemEval-2016 Task 12 (Clinical TempEval).

  Access Paper or Ask Questions

NeMo Toolbox for Speech Dataset Construction

Apr 11, 2021
Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

In this paper, we introduce a new toolbox for constructing speech datasets from long audio recording and raw reference texts. We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering. The proposed pipeline also supports human-in-the-loop to address text-audio mismatch issues and remove samples that don't satisfy the quality requirements. We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks. The toolbox is opne sourced in NeMo framework. The RuLS corpus is released in OpenSLR.

  Access Paper or Ask Questions

Digital Peter: Dataset, Competition and Handwriting Recognition Methods

Mar 16, 2021
Mark Potanin, Denis Dimitrov, Alex Shonenkov, Vladimir Bataev, Denis Karachev, Maxim Novopoltsev

This paper presents a new dataset of Peter the Great's manuscripts and describes a segmentation procedure that converts initial images of documents into the lines. The new dataset may be useful for researchers to train handwriting text recognition models as a benchmark for comparing different models. It consists of 9 694 images and text files corresponding to lines in historical documents. The open machine learning competition Digital Peter was held based on the considered dataset. The baseline solution for this competition as well as more advanced methods on handwritten text recognition are described in the article. Full dataset and all code are publicly available.

* 17 pages, 7 figures, submitted to ICDAR 2021 

  Access Paper or Ask Questions

Principal Components of the Meaning

Sep 18, 2020
Neslihan Suzen, Alexander Gorban, Jeremy Levesley, Evgeny Mirkes

In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.

  Access Paper or Ask Questions