Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.
We investigate multi-task learning approaches that use a shared feature representation for all tasks. To better understand the transfer of task information, we study an architecture with a shared module for all tasks and a separate output module for each task. We study the theory of this setting on linear and ReLU-activated models. Our key observation is that whether or not tasks' data are well-aligned can significantly affect the performance of multi-task learning. We show that misalignment between task data can cause negative transfer (or hurt performance) and provide sufficient conditions for positive transfer. Inspired by the theoretical insights, we show that aligning tasks' embedding layers leads to performance gains for multi-task training and transfer learning on the GLUE benchmark and sentiment analysis tasks; for example, we obtain a 2.35% GLUE score average improvement on 5 GLUE tasks over BERT-LARGE using our alignment method. We also design an SVD-based task reweighting scheme and show that it improves the robustness of multi-task training on a multi-label image dataset.
We study open domain response generation with limited message-response pairs. The problem exists in real-world applications but is less explored by the existing work. Since the paired data now is no longer enough to train a neural generation model, we consider leveraging the large scale of unpaired data that are much easier to obtain, and propose response generation with both paired and unpaired data. The generation model is defined by an encoder-decoder architecture with templates as prior, where the templates are estimated from the unpaired data as a neural hidden semi-markov model. By this means, response generation learned from the small paired data can be aided by the semantic and syntactic knowledge in the large unpaired data. To balance the effect of the prior and the input message to response generation, we propose learning the whole generation model with an adversarial approach. Empirical studies on question response generation and sentiment response generation indicate that when only a few pairs are available, our model can significantly outperform several state-of-the-art response generation models in terms of both automatic and human evaluation.
Typical methods for supervised sequence modeling are built upon the recurrent neural networks to capture temporal dependencies. One potential limitation of these methods is that they only model explicitly information interactions between adjacent time steps in a sequence, hence the high-order interactions between nonadjacent time steps are not fully exploited. It greatly limits the capability of modeling the long-range temporal dependencies since one-order interactions cannot be maintained for a long term due to information dilution and gradient vanishing. To tackle this limitation, we propose the Non-local Recurrent Neural Memory (NRNM) for supervised sequence modeling, which performs non-local operations to learn full-order interactions within a sliding temporal block and models global interactions between blocks in a gated recurrent manner. Consequently, our model is able to capture the long-range dependencies. Besides, the latent high-level features contained in high-order interactions can be distilled by our model. We demonstrate the merits of our NRNM on two different tasks: action recognition and sentiment analysis.
Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing consonant(s) and vowel, various character encoding schemes can be adopted beyond merely making up a one-hot vector. However, there has been little work done on intra-language comparison regarding performances using each representation. In this study, utilizing the Korean language which is character-rich and agglutinative, we investigate an encoding scheme that is the most effective among Jamo-level one-hot, character-level one-hot, character-level dense, and character-level multi-hot. Classification performance with each scheme is evaluated on two corpora: one on binary sentiment analysis of movie reviews, and the other on multi-class identification of intention types. The result displays that the character-level features show higher performance in general, although the Jamo-level features may show compatibility with the attention-based models if guaranteed adequate parameter size.
Existing controllable text generation systems rely on annotated attributes, which greatly limits their capabilities and applications. In this work, we make the first successful attempt to use VAEs to achieve controllable text generation without supervision. We do so by decomposing the latent space of the VAE into two parts: one incorporates structural constraints to capture dominant global variations implicitly present in the data, e.g., sentiment or topic; the other is unstructured and is used for the reconstruction of the source sentences. With the enforced structural constraint, the underlying global variations will be discovered and disentangled during the training of the VAE. The structural constraint also provides a natural recipe for mitigating posterior collapse for the structured part, which cannot be fully resolved by the existing techniques. On the task of text style transfer, our unsupervised approach achieves significantly better performance than previous supervised approaches. By showcasing generation with finer-grained control including Cards-Against-Humanity-style topic transitions within a sentence, we demonstrate that our model can perform controlled text generation in a more flexible way than existing methods.
In this article, we present the first in depth linguistic study of human feelings. While there has been substantial research on incorporating some affective categories into linguistic analysis (e.g. sentiment, and to a lesser extent, emotion), the more diverse category of human feelings has thus far not been investigated. We surveyed the extensive interdisciplinary literature around feelings to construct a working definition of what constitutes a feeling and propose 9 broad categories of feeling. We identified potential feeling words based on their pointwise mutual information with morphological variants of the word `feel' in the Google n-gram corpus, and present a manual annotation exercise where 317 WordNet senses of one hundred of these words were categorised as `not a feeling' or as one of the 9 proposed categories of feeling. We then proceeded to annotate 11386 WordNet senses of all these words to create WordNet-feelings, a new affective dataset that identifies 3664 word senses as feelings, and associates each of these with one of the 9 categories of feeling. WordNet-feelings can be used in conjunction with other datasets such as SentiWordNet that annotate word senses with complementary affective properties such as valence and intensity.
Automatically recognized terminology is widely used for various domain-specific texts processing tasks, such as machine translation, information retrieval or sentiment analysis. However, there is still no agreement on which methods are best suited for particular settings and, moreover, there is no reliable comparison of already developed methods. We believe that one of the main reasons is the lack of state-of-the-art methods implementations, which are usually non-trivial to recreate. In order to address these issues, we present ATR4S, an open-source software written in Scala that comprises more than 15 methods for automatic terminology recognition (ATR) and implements the whole pipeline from text document preprocessing, to term candidates collection, term candidates scoring, and finally, term candidates ranking. It is highly scalable, modular and configurable tool with support of automatic caching. We also compare 10 state-of-the-art methods on 7 open datasets by average precision and processing time. Experimental comparison reveals that no single method demonstrates best average precision for all datasets and that other available tools for ATR do not contain the best methods.
Complaining is a speech act that expresses a negative inconsistency between reality and human expectations. While prior studies mostly focus on identifying the existence or the type of complaints, in this work, we present the first study in computational linguistics of measuring the intensity of complaints from text. Analyzing complaints from such perspective is particularly useful, as complaints of certain degrees may cause severe consequences for companies or organizations. We create the first Chinese dataset containing 3,103 posts about complaints from Weibo, a popular Chinese social media platform. These posts are then annotated with complaints intensity scores using Best-Worst Scaling (BWS) method. We show that complaints intensity can be accurately estimated by computational models with the best mean square error achieving 0.11. Furthermore, we conduct a comprehensive linguistic analysis around complaints, including the connections between complaints and sentiment, and a cross-lingual comparison for complaints expressions used by Chinese and English speakers. We finally show that our complaints intensity scores can be incorporated for better estimating the popularity of posts on social media.