Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Arabic Text Categorization Algorithm using Vector Evaluation Method

Jan 06, 2015
Ashraf Odeh, Aymen Abu-Errub, Qusai Shambour, Nidal Turab

Text categorization is the process of grouping documents into categories based on their contents. This process is important to make information retrieval easier, and it became more important due to the huge textual information available online. The main problem in text categorization is how to improve the classification accuracy. Although Arabic text categorization is a new promising field, there are a few researches in this field. This paper proposes a new method for Arabic text categorization using vector evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the tested document's words are calculated to determine the document keywords which will be compared with the keywords of the corpus categorizes to determine the tested document's best category.

* International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014 

  Access Paper or Ask Questions

UCD-CS at W-NUT 2020 Shared Task-3: A Text to Text Approach for COVID-19 Event Extraction on Social Media

Oct 12, 2020
Congcong Wang, David Lillis

In this paper, we describe our approach in the shared task: COVID-19 event extraction from Twitter. The objective of this task is to extract answers from COVID-related tweets to a set of predefined slot-filling questions. Our approach treats the event extraction task as a question answering task by leveraging the transformer-based T5 text-to-text model. According to the official evaluation scores returned, namely F1, our submitted run achieves competitive performance compared to other participating runs (Top 3). However, we argue that this evaluation may underestimate the actual performance of runs based on text-generation. Although some such runs may answer the slot questions well, they may not be an exact string match for the gold standard answers. To measure the extent of this underestimation, we adopt a simple exact-answer transformation method aiming at converting the well-answered predictions to exactly-matched predictions. The results show that after this transformation our run overall reaches the same level of performance as the best participating run and state-of-the-art F1 scores in three of five COVID-related events. Our code is publicly available to aid reproducibility

* 8 pages, 2 figures 

  Access Paper or Ask Questions

Automated Classification of Text Sentiment

Apr 05, 2018
Emmanuel Dufourq, Bruce A. Bassett

The ability to identify sentiment in text, referred to as sentiment analysis, is one which is natural to adult humans. This task is, however, not one which a computer can perform by default. Identifying sentiments in an automated, algorithmic manner will be a useful capability for business and research in their search to understand what consumers think about their products or services and to understand human sociology. Here we propose two new Genetic Algorithms (GAs) for the task of automated text sentiment analysis. The GAs learn whether words occurring in a text corpus are either sentiment or amplifier words, and their corresponding magnitude. Sentiment words, such as 'horrible', add linearly to the final sentiment. Amplifier words in contrast, which are typically adjectives/adverbs like 'very', multiply the sentiment of the following word. This increases, decreases or negates the sentiment of the following word. The sentiment of the full text is then the sum of these terms. This approach grows both a sentiment and amplifier dictionary which can be reused for other purposes and fed into other machine learning algorithms. We report the results of multiple experiments conducted on large Amazon data sets. The results reveal that our proposed approach was able to outperform several public and/or commercial sentiment analysis algorithms.

* In "2017 Annual Conference of the South African Institute of Computer Scientists and Information" 

  Access Paper or Ask Questions

TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays

Jan 12, 2018
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Ronald M. Summers

Chest X-rays are one of the most common radiological examinations in daily clinical routines. Reporting thorax diseases using chest X-rays is often an entry-level task for radiologist trainees. Yet, reading a chest X-ray image remains a challenging job for learning-oriented machine intelligence, due to (1) shortage of large-scale machine-learnable medical image datasets, and (2) lack of techniques that can mimic the high-level reasoning of human radiologists that requires years of knowledge accumulation and professional training. In this paper, we show the clinical free-text radiological reports can be utilized as a priori knowledge for tackling these two key problems. We propose a novel Text-Image Embedding network (TieNet) for extracting the distinctive image and text representations. Multi-level attention models are integrated into an end-to-end trainable CNN-RNN architecture for highlighting the meaningful text words and image regions. We first apply TieNet to classify the chest X-rays by using both image features and text embeddings extracted from associated reports. The proposed auto-annotation framework achieves high accuracy (over 0.9 on average in AUCs) in assigning disease labels for our hand-label evaluation dataset. Furthermore, we transform the TieNet into a chest X-ray reporting system. It simulates the reporting process and can output disease classification and a preliminary report together. The classification results are significantly improved (6% increase on average in AUCs) compared to the state-of-the-art baseline on an unseen and hand-labeled dataset (OpenI).

* v1: Main paper + supplementary material 

  Access Paper or Ask Questions

Distribution augmentation for low-resource expressive text-to-speech

Feb 19, 2022
Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.

* ICASSP 2022: camera-ready 

  Access Paper or Ask Questions

StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis

Nov 04, 2021
Peter Schaldenbrand, Zhixuan Liu, Jean Oh

Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We introduce StyleCLIPDraw which adds a style loss to the CLIPDraw text-to-drawing synthesis model to allow artistic control of the synthesized drawings in addition to control of the content via text. Whereas performing decoupled style transfer on a generated image only affects the texture, our proposed coupled approach is able to capture a style in both texture and shape, suggesting that the style of the drawing is coupled with the drawing process itself. More results and our code are available at

  Access Paper or Ask Questions

BERT-Beta: A Proactive Probabilistic Approach to Text Moderation

Sep 18, 2021
Fei Tan, Yifan Hu, Kevin Yen, Changwei Hu

Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept {\it text toxicity propensity} to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.

* 9 pages, EMNLP'21 

  Access Paper or Ask Questions

Data Augmentation for Text Generation Without Any Augmented Data

May 28, 2021
Wei Bi, Huayang Li, Jiacheng Huang

Data augmentation is an effective way to improve the performance of many neural text generation models. However, current data augmentation methods need to define or choose proper data mapping functions that map the original samples into the augmented samples. In this work, we derive an objective to formulate the problem of data augmentation on text generation tasks without any use of augmented data constructed by specific mapping functions. Our proposed objective can be efficiently optimized and applied to popular loss functions on text generation tasks with a convergence rate guarantee. Experiments on five datasets of two text generation tasks show that our approach can approximate or even surpass popular data augmentation methods.

* Accepted into the main conference of ACL 2021 

  Access Paper or Ask Questions