Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Sentiment": models, code, and papers

uniblock: Scoring and Filtering Corpus with Unicode Block Information

Aug 26, 2019
Yingbo Gao, Weiyue Wang, Hermann Ney

The preprocessing pipelines in Natural Language Processing usually involve a step of removing sentences consisted of illegal characters. The definition of illegal characters and the specific removal strategy depend on the task, language, domain, etc, which often lead to tiresome and repetitive scripting of rules. In this paper, we introduce a simple statistical method, uniblock, to overcome this problem. For each sentence, uniblock generates a fixed-size feature vector using Unicode block information of the characters. A Gaussian mixture model is then estimated on some clean corpus using variational inference. The learned model can then be used to score sentences and filter corpus. We present experimental results on Sentiment Analysis, Language Modeling and Machine Translation, and show the simplicity and effectiveness of our method.

* EMNLP2019 

  Access Paper or Ask Questions

Evaluating Language Model Finetuning Techniques for Low-resource Languages

Jun 30, 2019
Jan Christian Blaise Cruz, Charibeth Cheng

Unlike mainstream languages (such as English and French), low-resource languages often suffer from a lack of expert-annotated corpora and benchmark resources that make it hard to apply state-of-the-art techniques directly. In this paper, we alleviate this scarcity problem for the low-resourced Filipino language in two ways. First, we introduce a new benchmark language modeling dataset in Filipino which we call WikiText-TL-39. Second, we show that language model finetuning techniques such as BERT and ULMFiT can be used to consistently train robust classifiers in low-resource settings, experiencing at most a 0.0782 increase in validation error when the number of training examples is decreased from 10K to 1K while finetuning using a privately-held sentiment dataset.

* Pretrained models and datasets available at 

  Access Paper or Ask Questions

Deep learning for language understanding of mental health concepts derived from Cognitive Behavioural Therapy

Sep 03, 2018
Lina Rojas-Barahona, Bo-Hsiang Tseng, Yinpei Dai, Clare Mansfield, Osman Ramadan, Stefan Ultes, Michael Crawford, Milica Gasic

In recent years, we have seen deep learning and distributed representations of words and sentences make impact on a number of natural language processing tasks, such as similarity, entailment and sentiment analysis. Here we introduce a new task: understanding of mental health concepts derived from Cognitive Behavioural Therapy (CBT). We define a mental health ontology based on the CBT principles, annotate a large corpus where this phenomena is exhibited and perform understanding using deep learning and distributed representations. Our results show that the performance of deep learning models combined with word embeddings or sentence embeddings significantly outperform non-deep-learning models in this difficult task. This understanding module will be an essential component of a statistical dialogue system delivering therapy.

* Accepted for publication at LOUHI 2018: The Ninth International Workshop on Health Text Mining and Information Analysis 

  Access Paper or Ask Questions

Text Classification based on Multiple Block Convolutional Highways

Jul 23, 2018
Seyed Mahdi Rezaeinia, Ali Ghodsi, Rouhollah Rahmani

In the Text Classification areas of Sentiment Analysis, Subjectivity/Objectivity Analysis, and Opinion Polarity, Convolutional Neural Networks have gained special attention because of their performance and accuracy. In this work, we applied recent advances in CNNs and propose a novel architecture, Multiple Block Convolutional Highways (MBCH), which achieves improved accuracy on multiple popular benchmark datasets, compared to previous architectures. The MBCH is based on new techniques and architectures including highway networks, DenseNet, batch normalization and bottleneck layers. In addition, to cope with the limitations of existing pre-trained word vectors which are used as inputs for the CNN, we propose a novel method, Improved Word Vectors (IWV). The IWV improves the accuracy of CNNs which are used for text classification tasks.

* arXiv admin note: text overlap with arXiv:1711.08609 

  Access Paper or Ask Questions

Improving Aspect Term Extraction with Bidirectional Dependency Tree Representation

May 21, 2018
Huaishao Luo, Tianrui Li, Bing Liu, Bin Wang, Herwig Unger

Aspect term extraction is one of the important subtasks in aspect-based sentiment analysis. Previous studies have shown that dependency tree structure representation is promising for this task. In this paper, we propose a novel bidirectional dependency tree network to extract dependency structure features from the given sentences. The key idea is to explicitly incorporate both representations gained separately from the bottom-up and top-down propagation on the given dependency syntactic tree. An end-to-end framework is proposed to integrate the embedded representations and BiLSTM plus CRF to learn both tree-structured and sequential features to solve the aspect term extraction problem. Experimental results demonstrate that the proposed model outperforms state-of-the-art baseline models on four benchmark SemEval datasets.

  Access Paper or Ask Questions

DepecheMood: a Lexicon for Emotion Analysis from Crowd-Annotated News

May 07, 2014
Jacopo Staiano, Marco Guerini

While many lexica annotated with words polarity are available for sentiment analysis, very few tackle the harder task of emotion analysis and are usually quite limited in coverage. In this paper, we present a novel approach for extracting - in a totally automated way - a high-coverage and high-precision lexicon of roughly 37 thousand terms annotated with emotion scores, called DepecheMood. Our approach exploits in an original way 'crowd-sourced' affective annotation implicitly provided by readers of news articles from By providing new state-of-the-art performances in unsupervised settings for regression and classification tasks, even using a na\"{\i}ve approach, our experiments show the beneficial impact of harvesting social media data for affective lexicon building.

* To appear at ACL 2014. 7 pages 

  Access Paper or Ask Questions

Using Nuances of Emotion to Identify Personality

Sep 24, 2013
Saif M. Mohammad, Svetlana Kiritchenko

Past work on personality detection has shown that frequency of lexical categories such as first person pronouns, past tense verbs, and sentiment words have significant correlations with personality traits. In this paper, for the first time, we show that fine affect (emotion) categories such as that of excitement, guilt, yearning, and admiration are significant indicators of personality. Additionally, we perform experiments to show that the gains provided by the fine affect categories are not obtained by using coarse affect categories alone or with specificity features alone. We employ these features in five SVM classifiers for detecting five personality traits through essays. We find that the use of fine emotion features leads to statistically significant improvement over a competitive baseline, whereas the use of coarse affect and specificity features does not.

* In Proceedings of the ICWSM Workshop on Computational Personality Recognition, July 2013, Boston, USA 

  Access Paper or Ask Questions

Prompt-based Pre-trained Model for Personality and Interpersonal Reactivity Prediction

Mar 23, 2022
Bin Li, Yixuan Weng, Qiya Song, Fuyan Ma, Bin Sun, Shutao Li

This paper describes the LingJing team's method to the Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA) 2022 shared task on Personality Prediction (PER) and Reactivity Index Prediction (IRI). In this paper, we adopt the prompt-based method with the pre-trained language model to accomplish these tasks. Specifically, the prompt is designed to provide the extra knowledge for enhancing the pre-trained model. Data augmentation and model ensemble are adopted for obtaining better results. Extensive experiments are performed, which shows the effectiveness of the proposed method. On the final submission, our system achieves a Pearson Correlation Coefficient of 0.2301 and 0.2546 on Track 3 and Track 4 respectively. We ranked Top-1 on both sub-tasks.

* The shared task paper described the contributions of the Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA) @ ACL-2022 

  Access Paper or Ask Questions

Privacy enabled Financial Text Classification using Differential Privacy and Federated Learning

Oct 04, 2021
Priyam Basu, Tiasa Singha Roy, Rakshit Naidu, Zumrut Muftuoglu

Privacy is important considering the financial Domain as such data is highly confidential and sensitive. Natural Language Processing (NLP) techniques can be applied for text classification and entity detection purposes in financial domains such as customer feedback sentiment analysis, invoice entity detection, categorisation of financial documents by type etc. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training large models with such data. In this work, we propose a contextualized transformer (BERT and RoBERTa) based text classification model integrated with privacy features such as Differential Privacy (DP) and Federated Learning (FL). We present how to privately train NLP models and desirable privacy-utility tradeoffs and evaluate them on the Financial Phrase Bank dataset.

* 4 pages. Accepted at ECONLP-EMNLP'21 

  Access Paper or Ask Questions

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

Oct 26, 2020
Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, Luis Espinosa-Anke

The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domain-specific data. In this paper, we propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different language modeling pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models, and continue training them on Twitter corpora.

* Findings of EMNLP 2020. TweetEval benchmark available at 

  Access Paper or Ask Questions