Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Sentiment Expression via Emoticons on Social Media

Nov 09, 2015
Hao Wang, Jorge A. Castanon

Emoticons (e.g., :) and :( ) have been widely used in sentiment analysis and other NLP tasks as features to ma- chine learning algorithms or as entries of sentiment lexicons. In this paper, we argue that while emoticons are strong and common signals of sentiment expression on social media the relationship between emoticons and sentiment polarity are not always clear. Thus, any algorithm that deals with sentiment polarity should take emoticons into account but extreme cau- tion should be exercised in which emoticons to depend on. First, to demonstrate the prevalence of emoticons on social media, we analyzed the frequency of emoticons in a large re- cent Twitter data set. Then we carried out four analyses to examine the relationship between emoticons and sentiment polarity as well as the contexts in which emoticons are used. The first analysis surveyed a group of participants for their perceived sentiment polarity of the most frequent emoticons. The second analysis examined clustering of words and emoti- cons to better understand the meaning conveyed by the emoti- cons. The third analysis compared the sentiment polarity of microblog posts before and after emoticons were removed from the text. The last analysis tested the hypothesis that removing emoticons from text hurts sentiment classification by training two machine learning models with and without emoticons in the text respectively. The results confirms the arguments that: 1) a few emoticons are strong and reliable signals of sentiment polarity and one should take advantage of them in any senti- ment analysis; 2) a large group of the emoticons conveys com- plicated sentiment hence they should be treated with extreme caution.

  Access Paper or Ask Questions

Measuring Uncertainty in Translation Quality Evaluation (TQE)

Nov 15, 2021
Serge Gladkoff, Irina Sorokina, Lifeng Han, Alexandra Alekseeva

From both human translators (HT) and machine translation (MT) researchers' point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAP) are often chosen as the golden standard \cite{han-etal-2021-TQA}. Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such motivated research to correctly estimate the confidence intervals \cite{Brown_etal2001Interval} depending on the sample size of the translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA).

* 13 pages, 9 figures 

  Access Paper or Ask Questions

Hetero-SCAN: Towards Social Context Aware Fake News Detection via Heterogeneous Graph Neural Network

Sep 13, 2021
Jian Cui, Kwanwoo Kim, Seung Ho Na, Seungwon Shin

Fake news, false or misleading information presented as news, has a great impact on many aspects of society, such as politics and healthcare. To handle this emerging problem, many fake news detection methods have been proposed, applying Natural Language Processing (NLP) techniques on the article text. Considering that even people cannot easily distinguish fake news by news content, these text-based solutions are insufficient. To further improve fake news detection, researchers suggested graph-based solutions, utilizing the social context information such as user engagement or publishers information. However, existing graph-based methods still suffer from the following four major drawbacks: 1) expensive computational cost due to a large number of user nodes in the graph, 2) the error in sub-tasks, such as textual encoding or stance detection, 3) loss of rich social context due to homogeneous representation of news graphs, and 4) the absence of temporal information utilization. In order to overcome the aforementioned issues, we propose a novel social context aware fake news detection method, Hetero-SCAN, based on a heterogeneous graph neural network. Hetero-SCAN learns the news representation from the heterogeneous graph of news in an end-to-end manner. We demonstrate that Hetero-SCAN yields significant improvement over state-of-the-art text-based and graph-based fake news detection methods in terms of performance and efficiency.

  Access Paper or Ask Questions

Content-driven, unsupervised clustering of news articles through multiscale graph partitioning

Aug 03, 2018
M. Tarik Altuncu, Sophia N. Yaliraki, Mauricio Barahona

The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time. There is an increasing need for tools that can pre-process, analyse and classify raw text to extract interpretable content; specifically, identifying topics and content-driven groupings of articles. We present here such a methodology that brings together powerful vector embeddings from Natural Language Processing with tools from Graph Theory that exploit diffusive dynamics on graphs to reveal natural partitions across scales. Our framework uses a recent deep neural network text analysis methodology (Doc2vec) to represent text in vector form and then applies a multi-scale community detection method (Markov Stability) to partition a similarity graph of document vectors. The method allows us to obtain clusters of documents with similar content, at different levels of resolution, in an unsupervised manner. We showcase our approach with the analysis of a corpus of 9,000 news articles published by Vox Media over one year. Our results show consistent groupings of documents according to content without a priori assumptions about the number or type of clusters to be found. The multilevel clustering reveals a quasi-hierarchy of topics and subtopics with increased intelligibility and improved topic coherence as compared to external taxonomy services and standard topic detection methods.

* 8 pages; 5 figures; To present at KDD 2018: Data Science, Journalism & Media workshop 

  Access Paper or Ask Questions

The ontogeny of discourse structure mimics the development of literature

Dec 27, 2016
Natalia Bezerra Mota, Sylvia Pinheiro, Mariano Sigman, Diego Fernandez Slezak, Guillermo Cecchi, Mauro Copelli, Sidarta Ribeiro

Discourse varies with age, education, psychiatric state and historical epoch, but the ontogenetic and cultural dynamics of discourse structure remain to be quantitatively characterized. To this end we investigated word graphs obtained from verbal reports of 200 subjects ages 2-58, and 676 literary texts spanning ~5,000 years. In healthy subjects, lexical diversity, graph size, and long-range recurrence departed from initial near-random levels through a monotonic asymptotic increase across ages, while short-range recurrence showed a corresponding decrease. These changes were explained by education and suggest a hierarchical development of discourse structure: short-range recurrence and lexical diversity stabilize after elementary school, but graph size and long-range recurrence only stabilize after high school. This gradual maturation was blurred in psychotic subjects, who maintained in adulthood a near-random structure. In literature, monotonic asymptotic changes over time were remarkable: While lexical diversity, long-range recurrence and graph size increased away from near-randomness, short-range recurrence declined, from above to below random levels. Bronze Age texts are structurally similar to childish or psychotic discourses, but subsequent texts converge abruptly to the healthy adult pattern around the onset of the Axial Age (800-200 BC), a period of pivotal cultural change. Thus, individually as well as historically, discourse maturation increases the range of word recurrence away from randomness.

* Natalia Bezerra Mota and Sylvia Pinheiro: Equal contribution Sidarta Ribeiro and Mauro Copelli: Corresponding authors 

  Access Paper or Ask Questions

A Readable Read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity

Mar 29, 2016
Ildikó Pilán, Sowmya Vajjala, Elena Volodina

Corpora and web texts can become a rich language learning resource if we have a means of assessing whether they are linguistically appropriate for learners at a given proficiency level. In this paper, we aim at addressing this issue by presenting the first approach for predicting linguistic complexity for Swedish second language learning material on a 5-point scale. After showing that the traditional Swedish readability measure, L\"asbarhetsindex (LIX), is not suitable for this task, we propose a supervised machine learning model, based on a range of linguistic features, that can reliably classify texts according to their difficulty level. Our model obtained an accuracy of 81.3% and an F-score of 0.8, which is comparable to the state of the art in English and is considerably higher than previously reported results for other languages. We further studied the utility of our features with single sentences instead of full texts since sentences are a common linguistic unit in language learning exercises. We trained a separate model on sentence-level data with five classes, which yielded 63.4% accuracy. Although this is lower than the document level performance, we achieved an adjacent accuracy of 92%. Furthermore, we found that using a combination of different features, compared to using lexical features alone, resulted in 7% improvement in classification accuracy at the sentence level, whereas at the document level, lexical features were more dominant. Our models are intended for use in a freely accessible web-based language learning platform for the automatic generation of exercises.

* Presented at CICLING 2015 and won the best poster award (16th International Conference on Intelligent Text Processing and Computational Linguistics). To appear in International Journal of Computational Linguistics and Applications (IJLCA), 2016 

  Access Paper or Ask Questions

Multi-Ontology Refined Embeddings (MORE): A Hybrid Multi-Ontology and Corpus-based Semantic Representation for Biomedical Concepts

Apr 14, 2020
Steven Jiang, Weiyi Wu, Naofumi Tomita, Craig Ganoe, Saeed Hassanpour

Objective: Currently, a major limitation for natural language processing (NLP) analyses in clinical applications is that a concept can be referenced in various forms across different texts. This paper introduces Multi-Ontology Refined Embeddings (MORE), a novel hybrid framework for incorporating domain knowledge from multiple ontologies into a distributional semantic model, learned from a corpus of clinical text. Materials and Methods: We use the RadCore and MIMIC-III free-text datasets for the corpus-based component of MORE. For the ontology-based part, we use the Medical Subject Headings (MeSH) ontology and three state-of-the-art ontology-based similarity measures. In our approach, we propose a new learning objective, modified from the Sigmoid cross-entropy objective function. Results and Discussion: We evaluate the quality of the generated word embeddings using two established datasets of semantic similarities among biomedical concept pairs. On the first dataset with 29 concept pairs, with the similarity scores established by physicians and medical coders, MORE's similarity scores have the highest combined correlation (0.633), which is 5.0% higher than that of the baseline model and 12.4% higher than that of the best ontology-based similarity measure.On the second dataset with 449 concept pairs, MORE's similarity scores have a correlation of 0.481, with the average of four medical residents' similarity ratings, and that outperforms the skip-gram model by 8.1% and the best ontology measure by 6.9%.

  Access Paper or Ask Questions

Copy-Enhanced Heterogeneous Information Learning for Dialogue State Tracking

Aug 21, 2019
Qingbin Liu, Shizhu He, Kang Liu, Shengping Liu, Jun Zhao

Dialogue state tracking (DST) is an essential component in task-oriented dialogue systems, which estimates user goals at every dialogue turn. However, most previous approaches usually suffer from the following problems. Many discriminative models, especially end-to-end (E2E) models, are difficult to extract unknown values that are not in the candidate ontology; previous generative models, which can extract unknown values from utterances, degrade the performance due to ignoring the semantic information of pre-defined ontology. Besides, previous generative models usually need a hand-crafted list to normalize the generated values. How to integrate the semantic information of pre-defined ontology and dialogue text (heterogeneous texts) to generate unknown values and improve performance becomes a severe challenge. In this paper, we propose a Copy-Enhanced Heterogeneous Information Learning model with multiple encoder-decoder for DST (CEDST), which can effectively generate all possible values including unknown values by copying values from heterogeneous texts. Meanwhile, CEDST can effectively decompose the large state space into several small state spaces through multi-encoder, and employ multi-decoder to make full use of the reduced spaces to generate values. Multi-encoder-decoder architecture can significantly improve performance. Experiments show that CEDST can achieve state-of-the-art results on two datasets and our constructed datasets with many unknown values.

* 12 pages, 4 figures 

  Access Paper or Ask Questions

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion

Aug 26, 2021
Muntasir Wahed, Daniel Gruhl, Alfredo Alba, Anna Lisa Gentile, Petar Ristoski, Chad Deluca, Steve Welch, Ismini Lourentzou

Recent advances in text representation have shown that training on large amounts of text is crucial for natural language understanding. However, models trained without predefined notions of topical interest typically require careful fine-tuning when transferred to specialized domains. When a sufficient amount of within-domain text may not be available, expanding a seed corpus of relevant documents from large-scale web data poses several challenges. First, corpus expansion requires scoring and ranking each document in the collection, an operation that can quickly become computationally expensive as the web corpora size grows. Relying on dense vector spaces and pairwise similarity adds to the computational expense. Secondly, as the domain concept becomes more nuanced, capturing the long tail of domain-specific rare terms becomes non-trivial, especially under limited seed corpora scenarios. In this paper, we consider the problem of fast approximate corpus expansion given a small seed corpus with a few relevant documents as a query, with the goal of capturing the long tail of a domain-specific set of concept terms. To efficiently collect large-scale domain-specific corpora with limited relevance feedback, we propose a novel truncated sparse document bit-vector representation, termed Signature Assisted Unsupervised Corpus Expansion (SAUCE). Experimental results show that SAUCE can reduce the computational burden while ensuring high within-domain lexical coverage.

* Accepted to CIKM'21 Applied Research Track 

  Access Paper or Ask Questions

Automatic Fairness Testing of Neural Classifiers through Adversarial Sampling

Jul 17, 2021
Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, Jin Song Dong

Although deep learning has demonstrated astonishing performance in many applications, there are still concerns on their dependability. One desirable property of deep learning applications with societal impact is fairness (i.e., non-discrimination). Unfortunately, discrimination might be intrinsically embedded into the models due to discrimination in the training data. As a countermeasure, fairness testing systemically identifies discriminative samples, which can be used to retrain the model and improve its fairness. Existing fairness testing approaches however have two major limitations. First, they only work well on traditional machine learning models and have poor performance (e.g., effectiveness and efficiency) on deep learning models. Second, they only work on simple tabular data and are not applicable for domains such as text. In this work, we bridge the gap by proposing a scalable and effective approach for systematically searching for discriminative samples while extending fairness testing to address a challenging domain, i.e., text classification. Compared with state-of-the-art methods, our approach only employs lightweight procedures like gradient computation and clustering, which makes it significantly more scalable. Experimental results show that on average, our approach explores the search space more effectively (9.62 and 2.38 times more than the state-of-art methods respectively on tabular and text datasets) and generates much more individual discriminatory instances (24.95 and 2.68 times) within reasonable time. The retrained models reduce discrimination by 57.2% and 60.2% respectively on average.

  Access Paper or Ask Questions