Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Evaluating Text Coherence at Sentence and Paragraph Levels

Jun 05, 2020
Sennan Liu, Shuang Zeng, Sujian Li

In this paper, to evaluate text coherence, we propose the paragraph ordering task as well as conducting sentence ordering. We collected four distinct corpora from different domains on which we investigate the adaptation of existing sentence ordering methods to a paragraph ordering task. We also compare the learnability and robustness of existing models by artificially creating mini datasets and noisy datasets respectively and verifying the efficiency of established models under these circumstances. Furthermore, we carry out human evaluation on the rearranged passages from two competitive models and confirm that WLCS-l is a better metric performing significantly higher correlations with human rating than tau, the most prevalent metric used before. Results from these evaluations show that except for certain extreme conditions, the recurrent graph neural network-based model is an optimal choice for coherence modeling.

* Long paper accepted by LREC 2020 

  Access Paper or Ask Questions

Incorporating Priors with Feature Attribution on Text Classification

Jun 19, 2019
Frederick Liu, Besim Avci

Feature attribution methods, proposed recently, help users interpret the predictions of complex models. Our approach integrates feature attributions into the objective function to allow machine learning practitioners to incorporate priors in model building. To demonstrate the effectiveness our technique, we apply it to two tasks: (1) mitigating unintended bias in text classifiers by neutralizing identity terms; (2) improving classifier performance in a scarce data setting by forcing the model to focus on toxic terms. Our approach adds an L2 distance loss between feature attributions and task-specific prior values to the objective. Our experiments show that i) a classifier trained with our technique reduces undesired model biases without a trade off on the original task; ii) incorporating priors helps model performance in scarce data settings.

* Accepted to ACL 2019 

  Access Paper or Ask Questions

Facet-Aware Evaluation for Extractive Text Summarization

Aug 27, 2019
Yuning Mao, Liyuan Liu, Qi Zhu, Xiang Ren, Jiawei Han

Commonly adopted metrics for extractive text summarization like ROUGE focus on the lexical similarity and are facet-agnostic. In this paper, we present a facet-aware evaluation procedure for better assessment of the information coverage in extracted summaries while still supporting automatic evaluation once annotated. Specifically, we treat \textit{facet} instead of \textit{token} as the basic unit for evaluation, manually annotate the \textit{support sentences} for each facet, and directly evaluate extractive methods by comparing the indices of extracted sentences with support sentences. We demonstrate the benefits of the proposed setup by performing a thorough \textit{quantitative} investigation on the CNN/Daily Mail dataset, which in the meantime reveals useful insights of state-of-the-art summarization methods.\footnote{Data can be found at \url{}.

  Access Paper or Ask Questions

Question Answering from Unstructured Text by Retrieval and Comprehension

Mar 26, 2017
Yusuke Watanabe, Bhuwan Dhingra, Ruslan Salakhutdinov

Open domain Question Answering (QA) systems must interact with external knowledge sources, such as web pages, to find relevant information. Information sources like Wikipedia, however, are not well structured and difficult to utilize in comparison with Knowledge Bases (KBs). In this work we present a two-step approach to question answering from unstructured text, consisting of a retrieval step and a comprehension step. For comprehension, we present an RNN based attention model with a novel mixture mechanism for selecting answers from either retrieved articles or a fixed vocabulary. For retrieval we introduce a hand-crafted model and a neural model for ranking relevant articles. We achieve state-of-the-art performance on W IKI M OVIES dataset, reducing the error by 40%. Our experimental results further demonstrate the importance of each of the introduced components.

  Access Paper or Ask Questions

Comparison of Czech Transformers on Text Classification Tasks

Jul 21, 2021
Jan Lehečka, Jan Švec

In this paper, we present our progress in pre-training monolingual Transformers for Czech and contribute to the research community by releasing our models for public. The need for such models emerged from our effort to employ Transformers in our language-specific tasks, but we found the performance of the published multilingual models to be very limited. Since the multilingual models are usually pre-trained from 100+ languages, most of low-resourced languages (including Czech) are under-represented in these models. At the same time, there is a huge amount of monolingual training data available in web archives like Common Crawl. We have pre-trained and publicly released two monolingual Czech Transformers and compared them with relevant public models, trained (at least partially) for Czech. The paper presents the Transformers pre-training procedure as well as a comparison of pre-trained models on text classification task from various domains.


  Access Paper or Ask Questions

Transliterating Kurdish texts in Latin into Persian-Arabic script

Oct 24, 2021
Hossein Hassani

Kurdish is written in different scripts. The two most popular scripts are Latin and Persian-Arabic. However, not all Kurdish readers are familiar with both mentioned scripts that could be resolved by automatic transliterators. So far, the developed tools mostly transliterate Persian-Arabic scripts into Latin. We present a transliterator to transliterate Kurdish texts in Latin into Persian-Arabic script. We also discuss the issues that should be considered in the transliteration process. The tool is a part of Kurdish BLARK, and it is publicly available for non-commercial use

* 4 pages 

  Access Paper or Ask Questions

Centroid estimation based on symmetric KL divergence for Multinomial text classification problem

Oct 24, 2018
Jiangning Chen, Heinrich Matzinger, Haoyan Zhai, Mi Zhou

We define a new method to estimate centroid for text classification based on the symmetric KL-divergence between the distribution of words in training documents and their class centroids. Experiments on several standard data sets indicate that the new method achieves substantial improvements over the traditional classifiers.

  Access Paper or Ask Questions

Text-based inference of moral sentiment change

Jan 20, 2020
Jing Yi Xie, Renato Ferreira Pinto Jr., Graeme Hirst, Yang Xu

We present a text-based framework for investigating moral sentiment change of the public via longitudinal corpora. Our framework is based on the premise that language use can inform people's moral perception toward right or wrong, and we build our methodology by exploring moral biases learned from diachronic word embeddings. We demonstrate how a parameter-free model supports inference of historical shifts in moral sentiment toward concepts such as slavery and democracy over centuries at three incremental levels: moral relevance, moral polarity, and fine-grained moral dimensions. We apply this methodology to visualizing moral time courses of individual concepts and analyzing the relations between psycholinguistic variables and rates of moral sentiment change at scale. Our work offers opportunities for applying natural language processing toward characterizing moral sentiment change in society.

* In Proceedings of EMNLP 2019 

  Access Paper or Ask Questions

Handwritten and Printed Text Separation in Real Document

Mar 19, 2013
Abdel Belaïd, K. C. Santosh, Vincent Poulain D'Andecy

The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset where samples are basically composed of complex administrative documents.

* Machine Vision Applications (2013) 

  Access Paper or Ask Questions

Step-unrolled Denoising Autoencoders for Text Generation

Dec 13, 2021
Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, Aaron van den Oord

In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT'14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template.

  Access Paper or Ask Questions