We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.
Computer systems need to be able to react to stress in order to perform optimally on some tasks. This article describes TensiStrength, a system to detect the strength of stress and relaxation expressed in social media text messages. TensiStrength uses a lexical approach and a set of rules to detect direct and indirect expressions of stress or relaxation, particularly in the context of transportation. It is slightly more effective than a comparable sentiment analysis program, although their similar performances occur despite differences on almost half of the tweets gathered. The effectiveness of TensiStrength depends on the nature of the tweets classified, with tweets that are rich in stress-related terms being particularly problematic. Although generic machine learning methods can give better performance than TensiStrength overall, they exploit topic-related terms in a way that may be undesirable in practical applications and that may not work as well in more focused contexts. In conclusion, TensiStrength and generic machine learning approaches work well enough to be practical choices for intelligent applications that need to take advantage of stress information, and the decision about which to use depends on the nature of the texts analysed and the purpose of the task.
A key aspect of word of mouth marketing are emotions. Emotions in texts help propagating messages in conventional advertising. In word of mouth scenarios, emotions help to engage consumers and incite to propagate the message further. While the function of emotions in offline marketing in general and word of mouth marketing in particular is rather well understood, online marketing can only offer a limited view on the function of emotions. In this contribution we seek to close this gap. We therefore investigate how emotions function in social media. To do so, we collected more than 30,000 brand marketing messages from the Google+ social networking site. Using state of the art computational linguistics classifiers, we compute the sentiment of these messages. Starting out with Poisson regression-based baseline models, we seek to replicate earlier findings using this large data set. We extend upon earlier research by computing multi-level mixed effects models that compare the function of emotions across different industries. We find that while the well known notion of activating emotions propagating messages holds in general for our data as well. But there are significant differences between the observed industries.
This study looks for signals of economic awareness on online social media and tests their significance in economic predictions. The study analyses, over a period of two years, the relationship between the West Texas Intermediate daily crude oil price and multiple predictors extracted from Twitter, Google Trends, Wikipedia, and the Global Data on Events, Language, and Tone database (GDELT). Semantic analysis is applied to study the sentiment, emotionality and complexity of the language used. Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX) models are used to make predictions and to confirm the value of the study variables. Results show that the combined analysis of the four media platforms carries valuable information in making financial forecasting. Twitter language complexity, GDELT number of articles and Wikipedia page reads have the highest predictive power. This study also allows a comparison of the different fore-sighting abilities of each platform, in terms of how many days ahead a platform can predict a price movement before it happens. In comparison with previous work, more media sources and more dimensions of the interaction and of the language used are combined in a joint analysis.
The combination of multilingual pre-trained representations and cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages. However, for extremely low-resource languages without large-scale monolingual corpora for pre-training or sufficient annotated data for fine-tuning, transfer learning remains an under-studied and challenging task. Moreover, recent work shows that multilingual representations are surprisingly disjoint across languages, bringing additional challenges for transfer onto extremely low-resource languages. In this paper, we propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one and brings their representation spaces closer for effective transfer. Extensive experiments on real-world low-resource languages - without access to large-scale monolingual corpora or large amounts of labeled data - for tasks like cross-lingual sentiment analysis and named entity recognition show the effectiveness of our approach. Code for MetaXL is publicly available at github.com/microsoft/MetaXL.
Aspect classification, identifying aspects of text segments, facilitates numerous applications, such as sentiment analysis and review summarization. To alleviate the human effort on annotating massive texts, in this paper, we study the problem of classifying aspects based on only a few user-provided seed words for pre-defined aspects. The major challenge lies in how to handle the noisy misc aspect, which is designed for texts without any pre-defined aspects. Even domain experts have difficulties to nominate seed words for the misc aspect, making existing seed-driven text classification methods not applicable. We propose a novel framework, ARYA, which enables mutual enhancements between pre-defined aspects and the misc aspect via iterative classifier training and seed updating. Specifically, it trains a classifier for pre-defined aspects and then leverages it to induce the supervision for the misc aspect. The prediction results of the misc aspect are later utilized to filter out noisy seed words for pre-defined aspects. Experiments in two domains demonstrate the superior performance of our proposed framework, as well as the necessity and importance of properly modeling the misc aspect.
Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.
In many practical applications, the training data for a machine learning task is partitioned across multiple learners. In many such cases, aggregating all the data onto a single device for model training may not be possible for privacy and communications bandwidth reasons. Existing approaches for learning global models in these settings iteratively aggregate local updates from each learner. These approaches can require many rounds of communication between learners and may produce suboptimal results when the data distribution is extremely non-i.i.d. In this work, we present Good-Enough Model Spaces (GEMS), a framework for learning a global satisficing (i.e. "good-enough") model by considering the space of local learners satisficing models. Our approach does not require sharing of data between learners, makes no assumptions on the distribution of data across learners, and requires minimal communication between learners. We present methods for learning both shallow and deep models within this framework and discuss how hold-out data can be used for post-learning fine-tuning. We find our approaches perform comparably to non-distributed models on experiments in image recognition and sentiment analysis.
Although various techniques have been proposed to generate adversarial samples for white-box attacks on text, little attention has been paid to black-box attacks, which are more realistic scenarios. In this paper, we present a novel algorithm, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. We employ novel scoring strategies to identify the critical tokens that, if modified, cause the classifier to make an incorrect prediction. Simple character-level transformations are applied to the highest-ranked tokens in order to minimize the edit distance of the perturbation, yet change the original classification. We evaluated DeepWordBug on eight real-world text datasets, including text classification, sentiment analysis, and spam detection. We compare the result of DeepWordBug with two baselines: Random (Black-box) and Gradient (White-box). Our experimental results indicate that DeepWordBug reduces the prediction accuracy of current state-of-the-art deep-learning models, including a decrease of 68\% on average for a Word-LSTM model and 48\% on average for a Char-CNN model.
Natural Language Generation (NLG) has made great progress in recent years due to the development of deep learning techniques such as pre-trained language models. This advancement has resulted in more fluent, coherent and even properties controllable (e.g. stylistic, sentiment, length etc.) generation, naturally leading to development in downstream tasks such as abstractive summarization, dialogue generation, machine translation, and data-to-text generation. However, the faithfulness problem that the generated text usually contains unfaithful or non-factual information has become the biggest challenge, which makes the performance of text generation unsatisfactory for practical applications in many real-world scenarios. Many studies on analysis, evaluation, and optimization methods for faithfulness problems have been proposed for various tasks, but have not been organized, compared and discussed in a combined manner. In this survey, we provide a systematic overview of the research progress on the faithfulness problem of NLG, including problem analysis, evaluation metrics and optimization methods. We organize the evaluation and optimization methods for different tasks into a unified taxonomy to facilitate comparison and learning across tasks. Several research trends are discussed further.