Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

Nov 11, 2021
Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, Sebastian Gehrmann

NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.

* 10 pages, 2 figures, accepted to NeurIPS 2021 Datasets and Benchmarks Track 

  Access Paper or Ask Questions

Have convolutions already made recurrence obsolete for unconstrained handwritten text recognition ?

Dec 09, 2020
Denis Coquenet, Yann Soullard, Clément Chatelain, Thierry Paquet

Unconstrained handwritten text recognition remains an important challenge for deep neural networks. These last years, recurrent networks and more specifically Long Short-Term Memory networks have achieved state-of-the-art performance in this field. Nevertheless, they are made of a large number of trainable parameters and training recurrent neural networks does not support parallelism. This has a direct influence on the training time of such architectures, with also a direct consequence on the time required to explore various architectures. Recently, recurrence-free architectures such as Fully Convolutional Networks with gated mechanisms have been proposed as one possible alternative achieving competitive results. In this paper, we explore convolutional architectures and compare them to a CNN+BLSTM baseline. We propose an experimental study regarding different architectures on an offline handwriting recognition task using the RIMES dataset, and a modified version of it that consists of augmenting the images with notebook backgrounds that are printed grids.

* 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) 

  Access Paper or Ask Questions

Label-Wise Document Pre-Training for Multi-Label Text Classification

Aug 15, 2020
Han Liu, Caixia Yuan, Xiaojie Wang

A major challenge of multi-label text classification (MLTC) is to stimulatingly exploit possible label differences and label correlations. In this paper, we tackle this challenge by developing Label-Wise Pre-Training (LW-PT) method to get a document representation with label-aware information. The basic idea is that, a multi-label document can be represented as a combination of multiple label-wise representations, and that, correlated labels always cooccur in the same or similar documents. LW-PT implements this idea by constructing label-wise document classification tasks and trains label-wise document encoders. Finally, the pre-trained label-wise encoder is fine-tuned with the downstream MLTC task. Extensive experimental results validate that the proposed method has significant advantages over the previous state-of-the-art models and is able to discover reasonable label relationship. The code is released to facilitate other researchers.

* Accepted to NLPCC 2020 

  Access Paper or Ask Questions

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

May 30, 2020
Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba Priyadharshini, John P. McCrae

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

  Access Paper or Ask Questions

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Mar 04, 2020
Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, Jing Xiao

Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.

* will be presented in ICASSP 2020 

  Access Paper or Ask Questions

A Cross-Sentence Latent Variable Model for Semi-Supervised Text Sequence Matching

Jun 04, 2019
Jihun Choi, Taeuk Kim, Sang-goo Lee

We present a latent variable model for predicting the relationship between a pair of text sequences. Unlike previous auto-encoding--based approaches that consider each sequence separately, our proposed framework utilizes both sequences within a single model by generating a sequence that has a given relationship with a source sequence. We further extend the cross-sentence generating framework to facilitate semi-supervised training. We also define novel semantic constraints that lead the decoder network to generate semantically plausible and diverse sequences. We demonstrate the effectiveness of the proposed model from quantitative and qualitative experiments, while achieving state-of-the-art results on semi-supervised natural language inference and paraphrase identification.

* ACL 2019 

  Access Paper or Ask Questions

Impact of Stop Sets on Stopping Active Learning for Text Classification

Jan 08, 2022
Luke Kurlandski, Michael Bloodgood

Active learning is an increasingly important branch of machine learning and a powerful technique for natural language processing. The main advantage of active learning is its potential to reduce the amount of labeled data needed to learn high-performing models. A vital aspect of an effective active learning algorithm is the determination of when to stop obtaining additional labeled data. Several leading state-of-the-art stopping methods use a stop set to help make this decision. However, there has been relatively less attention given to the choice of stop set than to the stopping algorithms that are applied on the stop set. Different choices of stop sets can lead to significant differences in stopping method performance. We investigate the impact of different stop set choices on different stopping methods. This paper shows the choice of the stop set can have a significant impact on the performance of stopping methods and the impact is different for stability-based methods from that on confidence-based methods. Furthermore, the unbiased representative stop sets suggested by original authors of methods work better than the systematically biased stop sets used in recently published work, and stopping methods based on stabilizing predictions have stronger performance than confidence-based stopping methods when unbiased representative stop sets are used. We provide the largest quantity of experimental results on the impact of stop sets to date. The findings are important for helping to illuminate the impact of this important aspect of stopping methods that has been under-considered in recently published work and that can have a large practical impact on the performance of stopping methods for important semantic computing applications such as technology assisted review and text classification more broadly.

* 8 pages, 3 tables, 1 figure; to appear in Proceedings of the IEEE 16th International Conference on Semantic Computing (ICSC 2022) 

  Access Paper or Ask Questions

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Jan 05, 2021
Sourav Ghosh, Sourabh Vasant Gothe, Chandramouli Sanchi, Barath Raj Kandur Raja

Abugida refers to a phonogram writing system where each syllable is represented using a single consonant or typographic ligature, along with a default vowel or optional diacritic(s) to denote other vowels. However, texting in these languages has some unique challenges in spite of the advent of devices with soft keyboard supporting custom key layouts. The number of characters in these languages is large enough to require characters to be spread over multiple views in the layout. Having to switch between views many times to type a single word hinders the natural thought process. This prevents popular usage of native keyboard layouts. On the other hand, supporting romanized scripts (native words transcribed using Latin characters) with language model based suggestions is also set back by the lack of uniform romanization rules. To this end, we propose a disambiguation algorithm and showcase its usefulness in two novel mutually non-exclusive input methods for languages natively using the abugida writing system: (a) disambiguation of ambiguous input for abugida scripts, and (b) disambiguation of word variants in romanized scripts. We benchmark these approaches using public datasets, and show an improvement in typing speed by 19.49%, 25.13%, and 14.89%, in Hindi, Bengali, and Thai, respectively, using Ambiguous Input, owing to the human ease of locating keys combined with the efficiency of our inference method. Our Word Variant Disambiguation (WDA) maps valid variants of romanized words, previously treated as Out-of-Vocab, to a vocabulary of 100k words with high accuracy, leading to an increase in Error Correction F1 score by 10.03% and Next Word Prediction (NWP) by 62.50% on average.

* Accepted for publication in the 15th IEEE International Conference on Semantic Computing (IEEE ICSC 2021) 

  Access Paper or Ask Questions

Understanding, Detecting, and Separating Out-of-Distribution Samples and Adversarial Samples in Text Classification

Apr 09, 2022
Cheng-Han Chiang, Hung-yi Lee

In this paper, we study the differences and commonalities between statistically out-of-distribution (OOD) samples and adversarial (Adv) samples, both of which hurting a text classification model's performance. We conduct analyses to compare the two types of anomalies (OOD and Adv samples) with the in-distribution (ID) ones from three aspects: the input features, the hidden representations in each layer of the model, and the output probability distributions of the classifier. We find that OOD samples expose their aberration starting from the first layer, while the abnormalities of Adv samples do not emerge until the deeper layers of the model. We also illustrate that the models' output probabilities for Adv samples tend to be more unconfident. Based on our observations, we propose a simple method to separate ID, OOD, and Adv samples using the hidden representations and output probabilities of the model. On multiple combinations of ID, OOD datasets, and Adv attacks, our proposed method shows exceptional results on distinguishing ID, OOD, and Adv samples.

* Preprint. Work in progress 

  Access Paper or Ask Questions

D2S: Document-to-Slide Generation Via Query-Based Text Summarization

May 08, 2021
Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, Nancy X. R. Wang

Presentations are critical for communication in all areas of our lives, yet the creation of slide decks is often tedious and time-consuming. There has been limited research aiming to automate the document-to-slides generation process and all face a critical challenge: no publicly available dataset for training and benchmarking. In this work, we first contribute a new dataset, SciDuet, consisting of pairs of papers and their corresponding slides decks from recent years' NLP and ML conferences (e.g., ACL). Secondly, we present D2S, a novel system that tackles the document-to-slides task with a two-step approach: 1) Use slide titles to retrieve relevant and engaging text, figures, and tables; 2) Summarize the retrieved context into bullet points with long-form question answering. Our evaluation suggests that long-form QA outperforms state-of-the-art summarization baselines on both automated ROUGE metrics and qualitative human evaluation.

* accepted at NAACL 2021 

  Access Paper or Ask Questions