Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression

Apr 16, 2019
Sandip Modha, Prasenjit Majumder

This paper attempt to study the effectiveness of text representation schemes on two tasks namely: User Aggression and Fact Detection from the social media contents. In User Aggression detection, The aim is to identify the level of aggression from the contents generated in the Social media and written in the English, Devanagari Hindi and Romanized Hindi. Aggression levels are categorized into three predefined classes namely: `Non-aggressive`, `Overtly Aggressive`, and `Covertly Aggressive`. During the disaster-related incident, Social media like, Twitter is flooded with millions of posts. In such emergency situations, identification of factual posts is important for organizations involved in the relief operation. We anticipated this problem as a combination of classification and Ranking problem. This paper presents a comparison of various text representation scheme based on BoW techniques, distributed word/sentence representation, transfer learning on classifiers. Weighted $F_1$ score is used as a primary evaluation metric. Results show that text representation using BoW performs better than word embedding on machine learning classifiers. While pre-trained Word embedding techniques perform better on classifiers based on deep neural net. Recent transfer learning model like ELMO, ULMFiT are fine-tuned for the Aggression classification task. However, results are not at par with pre-trained word embedding model. Overall, word embedding using fastText produce best weighted $F_1$-score than Word2Vec and Glove. Results are further improved using pre-trained vector model. Statistical significance tests are employed to ensure the significance of the classification results. In the case of lexically different test Dataset, other than training Dataset, deep neural models are more robust and perform substantially better than machine learning classifiers.

* 21 Page, 2 Figure 

  Access Paper or Ask Questions

On the Place of Text Data in Lifelogs, and Text Analysis via Semantic Facets

Jun 08, 2016
Gregory Grefenstette, Lawrence Muchemi

Current research in lifelog data has not paid enough attention to analysis of cognitive activities in comparison to physical activities. We argue that as we look into the future, wearable devices are going to be cheaper and more prevalent and textual data will play a more significant role. Data captured by lifelogging devices will increasingly include speech and text, potentially useful in analysis of intellectual activities. Analyzing what a person hears, reads, and sees, we should be able to measure the extent of cognitive activity devoted to a certain topic or subject by a learner. Test-based lifelog records can benefit from semantic analysis tools developed for natural language processing. We show how semantic analysis of such text data can be achieved through the use of taxonomic subject facets and how these facets might be useful in quantifying cognitive activity devoted to various topics in a person's day. We are currently developing a method to automatically create taxonomic topic vocabularies that can be applied to this detection of intellectual activity.

* iConference 2016 SIE on Lifelogging, Mar 2016, Philadelphia, United States. iConference 2016 SIE on Lifelogging, 2016 

  Access Paper or Ask Questions

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Oct 14, 2021
Junyi Ao, Rui Wang, Long Zhou, Shujie Liu, Shuo Ren, Yu Wu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-training natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the speech/text input through the pre-nets, the shared encoder-decoder network models the sequence to sequence transformation, and then the post-nets generate the output in the speech/text modality based on the decoder output. Particularly, SpeechT5 can pre-train on a large scale of unlabeled speech and text data to improve the capability of the speech and textual modeling. To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text. Extensive evaluations on a wide variety of spoken language processing tasks, including voice conversion, automatic speech recognition, text to speech, and speaker identification, show the superiority of the proposed SpeechT5 framework.

* work in process 

  Access Paper or Ask Questions

Text to brain: predicting the spatial distribution of neuroimaging observations from text reports

Jun 28, 2018
Jérôme Dockès, Demian Wassermann, Russell Poldrack, Fabian Suchanek, Bertrand Thirion, Gaël Varoquaux

Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is formulated in terms of minimization of a risk on distributions which leads to a least-deviation cost function. An efficient algorithm in the dual then learns the mapping from documents to brain structures. Empirical results using coordinates extracted from the brain-imaging literature show that i) models must adapt to semantic variation in the terms used to describe a given anatomical structure, ii) voxel-wise parameterization leads to higher likelihood of locations reported in unseen documents, iii) least-deviation cost outperforms least-square. As a proof of concept for our method, we use our model of spatial distributions to predict the distribution of specific neurological conditions from text-only reports.

* MICCAI 2018 - 21st International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2018, Granada, Spain. pp.1-18, 2018 

  Access Paper or Ask Questions

What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey

Sep 14, 2021
Md Rayhanur Rahman, Rezvan Mahdavi-Hezaveh, Laurie Williams

Cybersecurity researchers have contributed to the automated extraction of CTI from textual sources, such as threat reports and online articles, where cyberattack strategies, procedures, and tools are described. The goal of this article is to aid cybersecurity researchers understand the current techniques used for cyberthreat intelligence extraction from text through a survey of relevant studies in the literature. We systematically collect "CTI extraction from text"-related studies from the literature and categorize the CTI extraction purposes. We propose a CTI extraction pipeline abstracted from these studies. We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline. Our work finds ten types of extraction purposes, such as extraction indicators of compromise extraction, TTPs (tactics, techniques, procedures of attack), and cybersecurity keywords. We also identify seven types of textual sources for CTI extraction, and textual data obtained from hacker forums, threat reports, social media posts, and online news articles have been used by almost 90% of the studies. Natural language processing along with both supervised and unsupervised machine learning techniques such as named entity recognition, topic modelling, dependency parsing, supervised classification, and clustering are used for CTI extraction. We observe the technical challenges associated with these studies related to obtaining available clean, labelled data which could assure replication, validation, and further extension of the studies. As we find the studies focusing on CTI information extraction from text, we advocate for building upon the current CTI extraction work to help cybersecurity practitioners with proactive decision making such as threat prioritization, automated threat modelling to utilize knowledge from past cybersecurity incidents.

  Access Paper or Ask Questions

Block-wise Lensless Compressive Camera

Jan 19, 2017
Xin Yuan, Gang Huang, Hong Jiang, Paul Wilford

The existing lensless compressive camera ($\text{L}^2\text{C}^2$)~\cite{Huang13ICIP} suffers from low capture rates, resulting in low resolution images when acquired over a short time. In this work, we propose a new regime to mitigate these drawbacks. We replace the global-based compressive sensing used in the existing $\text{L}^2\text{C}^2$ by the local block (patch) based compressive sensing. We use a single sensor for each block, rather than for the entire image, thus forming a multiple but spatially parallel sensor $\text{L}^2\text{C}^2$. This new camera retains the advantages of existing $\text{L}^2\text{C}^2$ while leading to the following additional benefits: 1) Since each block can be very small, {\em e.g.}$~8\times 8$ pixels, we only need to capture $\sim 10$ measurements to achieve reasonable reconstruction. Therefore the capture time can be reduced significantly. 2) The coding patterns used in each block can be the same, therefore the sensing matrix is only of the block size compared to the entire image size in existing $\text{L}^2\text{C}^2$. This saves the memory requirement of the sensing matrix as well as speeds up the reconstruction. 3) Patch based image reconstruction is fast and since real time stitching algorithms exist, we can perform real time reconstruction. 4) These small blocks can be integrated to any desirable number, leading to ultra high resolution images while retaining fast capture rate and fast reconstruction. We develop multiple geometries of this block-wise $\text{L}^2\text{C}^2$ in this paper. We have built prototypes of the proposed block-wise $\text{L}^2\text{C}^2$ and demonstrated excellent results of real data.

* 5 pages, 10 figures 

  Access Paper or Ask Questions

Entropy optimized semi-supervised decomposed vector-quantized variational autoencoder model based on transfer learning for multiclass text classification and generation

Nov 10, 2021
Shivani Malhotra, Vinay Kumar, Alpana Agarwal

Semisupervised text classification has become a major focus of research over the past few years. Hitherto, most of the research has been based on supervised learning, but its main drawback is the unavailability of labeled data samples in practical applications. It is still a key challenge to train the deep generative models and learn comprehensive representations without supervision. Even though continuous latent variables are employed primarily in deep latent variable models, discrete latent variables, with their enhanced understandability and better compressed representations, are effectively used by researchers. In this paper, we propose a semisupervised discrete latent variable model for multi-class text classification and text generation. The proposed model employs the concept of transfer learning for training a quantized transformer model, which is able to learn competently using fewer labeled instances. The model applies decomposed vector quantization technique to overcome problems like posterior collapse and index collapse. Shannon entropy is used for the decomposed sub-encoders, on which a variable DropConnect is applied, to retain maximum information. Moreover, gradients of the Loss function are adaptively modified during backpropagation from decoder to encoder to enhance the performance of the model. Three conventional datasets of diversified range have been used for validating the proposed model on a variable number of labeled instances. Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.

* 12 pages, 4 figures 

  Access Paper or Ask Questions

A Multi-cascaded Deep Model for Bilingual SMS Classification

Nov 29, 2019
Muhammad Haroon Shakeel, Asim Karim, Imdadullah Khan

Most studies on text classification are focused on the English language. However, short texts such as SMS are influenced by regional languages. This makes the automatic text classification task challenging due to the multilingual, informal, and noisy nature of language in the text. In this work, we propose a novel multi-cascaded deep learning model called McM for bilingual SMS classification. McM exploits $n$-gram level information as well as long-term dependencies of text for learning. Our approach aims to learn a model without any code-switching indication, lexical normalization, language translation, or language transliteration. The model relies entirely upon the text as no external knowledge base is utilized for learning. For this purpose, a 12 class bilingual text dataset is developed from SMS feedbacks of citizens on public services containing mixed Roman Urdu and English languages. Our model achieves high accuracy for classification on this dataset and outperforms the previous model for multilingual text classification, highlighting language independence of McM.

  Access Paper or Ask Questions

Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Apr 15, 2021
Shubhi Tyagi, Antonio Bonafonte, Jaime Lorenzo-Trueba, Javier Latorre

Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn majority of the classes and their normalizations from the training data itself. This is further combined with minimal precoded linguistic knowledge for other classes. We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English. All annotated datasets used for experimentation will be released at

* Accepted to NAACL 2021 

  Access Paper or Ask Questions

Teach me how to Label: Labeling Functions from Natural Language with Text-to-text Transformers

Jan 18, 2021
Yannis Papanikolaou

Annotated data has become the most important bottleneck in training accurate machine learning models, especially for areas that require domain expertise. A recent approach to deal with the above issue proposes using natural language explanations instead of labeling individual data points, thereby increasing human annotators' efficiency as well as decreasing costs substantially. This paper focuses on the task of turning these natural language descriptions into Python labeling functions by following a novel approach to semantic parsing with pre-trained text-to-text Transformers. In a series of experiments our approach achieves a new state of the art on the semantic parsing benchmark CoNaLa, surpassing the previous best approach by 3.7 BLEU points. Furthermore, on a manually constructed dataset of natural language descriptions-labeling functions pairs we achieve a BLEU of 0.39. Our approach can be regarded as a stepping stone towards models that are taught how to label in natural language, instead of being provided specific labeled samples. Our code, constructed dataset and models are available at

  Access Paper or Ask Questions