Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Automatic text summarization: What has been done and what has to be done

Apr 01, 2019
Abdelkrime Aries, Djamel eddine Zegour, Walid Khaled Hidouci

Summaries are important when it comes to process huge amounts of information. Their most important benefit is saving time, which we do not have much nowadays. Therefore, a summary must be short, representative and readable. Generating summaries automatically can be beneficial for humans, since it can save time and help selecting relevant documents. Automatic summarization and, in particular, Automatic text summarization (ATS) is not a new research field; It was known since the 50s. Since then, researchers have been active to find the perfect summarization method. In this article, we will discuss different works in automatic summarization, especially the recent ones. We will present some problems and limits which prevent works to move forward. Most of these challenges are much more related to the nature of processed languages. These challenges are interesting for academics and developers, as a path to follow in this field.

  Access Paper or Ask Questions

Chapter Captor: Text Segmentation in Novels

Nov 09, 2020
Charuta Pethe, Allen Kim, Steven Skiena

Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving an F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.

* 11 pages, 10 figures, Accepted at EMNLP 2020 as a long paper 

  Access Paper or Ask Questions

Cascaded Text Generation with Markov Transformers

Jun 01, 2020
Yuntian Deng, Alexander M. Rush

The two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output. To parameterize this cascade, we introduce a Markov transformer, a variant of the popular fully autoregressive model that allows us to simultaneously decode with specific autoregressive context cutoffs. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.

  Access Paper or Ask Questions

Text Augmentation in a Multi-Task View

Jan 14, 2021
Jason Wei, Chengyu Huang, Shiqi Xu, Soroush Vosoughi

Traditional data augmentation aims to increase the coverage of the input distribution by generating augmented examples that strongly resemble original samples in an online fashion where augmented examples dominate training. In this paper, we propose an alternative perspective -- a multi-task view (MTV) of data augmentation -- in which the primary task trains on original examples and the auxiliary task trains on augmented examples. In MTV data augmentation, both original and augmented samples are weighted substantively during training, relaxing the constraint that augmented examples must resemble original data and thereby allowing us to apply stronger levels of augmentation. In empirical experiments using four common data augmentation techniques on three benchmark text classification datasets, we find that the MTV leads to higher and more robust performance improvements than traditional augmentation.

* Accepted to EACL 2021 

  Access Paper or Ask Questions

Hybrid Ranking Network for Text-to-SQL

Aug 11, 2020
Qin Lyu, Kaushik Chakrabarti, Shobhit Hathi, Souvik Kundu, Jianwen Zhang, Zheng Chen

In this paper, we study how to leverage pre-trained language models in Text-to-SQL. We argue that previous approaches under utilize the base language models by concatenating all columns together with the NL question and feeding them into the base language model in the encoding stage. We propose a neat approach called Hybrid Ranking Network (HydraNet) which breaks down the problem into column-wise ranking and decoding and finally assembles the column-wise outputs into a SQL query by straightforward rules. In this approach, the encoder is given a NL question and one individual column, which perfectly aligns with the original tasks BERT/RoBERTa is trained on, and hence we avoid any ad-hoc pooling or additional encoding layers which are necessary in prior approaches. Experiments on the WikiSQL dataset show that the proposed approach is very effective, achieving the top place on the leaderboard.

  Access Paper or Ask Questions

Cross-lingual Distillation for Text Classification

Mar 28, 2018
Ruochen Xu, Yiming Yang

Cross-lingual text classification(CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. This paper presents a novel approach to CLTC that builds on model distillation, which adapts and extends a framework originally proposed for model compression. Using soft probabilistic predictions for the documents in a label-rich language as the (induced) supervisory labels in a parallel corpus of documents, we train classifiers successfully for new languages in which labeled training data are not available. An adversarial feature adaptation technique is also applied during the model training to reduce distribution mismatch. We conducted experiments on two benchmark CLTC datasets, treating English as the source language and German, French, Japan and Chinese as the unlabeled target languages. The proposed approach had the advantageous or comparable performance of the other state-of-art methods.

* Accepted at ACL 2017; Code available at 

  Access Paper or Ask Questions

Multiple-Attribute Text Style Transfer

Nov 01, 2018
Sandeep Subramanian, Guillaume Lample, Eric Michael Smith, Ludovic Denoyer, Marc'Aurelio Ranzato, Y-Lan Boureau

The dominant approach to unsupervised "style transfer" in text is based on the idea of learning a latent representation, which is independent of the attributes specifying its "style". In this paper, we show that this condition is not necessary and is not always met in practice, even with domain adversarial training that explicitly aims at learning such disentangled representations. We thus propose a new model that controls several factors of variation in textual data where this condition on disentanglement is replaced with a simpler mechanism based on back-translation. Our method allows control over multiple attributes, like gender, sentiment, product type, etc., and a more fine-grained control on the trade-off between content preservation and change of style with a pooling operator in the latent space. Our experiments demonstrate that the fully entangled model produces better generations, even when tested on new and more challenging benchmarks comprising reviews with multiple sentences and multiple attributes.

  Access Paper or Ask Questions

Analysis of Basic Emotions in Texts Based on BERT Vector Representation

Jan 31, 2021
A. Artemov, A. Veselovskiy, I. Khasenevich, I. Bolokhov

In the following paper the authors present a GAN-type model and the most important stages of its development for the task of emotion recognition in text. In particular, we propose an approach for generating a synthetic dataset of all possible emotions combinations based on manually labelled incomplete data.

* 8 pages, 2 figures, 5 tables 

  Access Paper or Ask Questions

Corpus Statistics in Text Classification of Online Data

Mar 16, 2018
Marina Sokolova, Victoria Bobicev

Transformation of Machine Learning (ML) from a boutique science to a generally accepted technology has increased importance of reproduction and transportability of ML studies. In the current work, we investigate how corpus characteristics of textual data sets correspond to text classification results. We work with two data sets gathered from sub-forums of an online health-related forum. Our empirical results are obtained for a multi-class sentiment analysis application.

* 12 pages, 6 tables, 1 figure 

  Access Paper or Ask Questions

Character-level Convolutional Networks for Text Classification

Apr 04, 2016
Xiang Zhang, Junbo Zhao, Yann LeCun

This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.

* An early version of this work entitled "Text Understanding from Scratch" was posted in Feb 2015 as arXiv:1502.01710. The present paper has considerably more experimental results and a rewritten introduction, Advances in Neural Information Processing Systems 28 (NIPS 2015) 

  Access Paper or Ask Questions