Despite the excellent performance of black box approaches to modeling sentiment and emotion, lexica (sets of informative words and associated weights) that characterize different emotions are indispensable to the NLP community because they allow for interpretable and robust predictions. Emotion analysis of text is increasing in popularity in NLP; however, manually creating lexica for psychological constructs such as empathy has proven difficult. This paper automatically creates empathy word ratings from document-level ratings. The underlying problem of learning word ratings from higher-level supervision has to date only been addressed in an ad hoc fashion and is missing deep learning methods. We systematically compare a number of approaches to learning word ratings from higher-level supervision against a Mixed-Level Feed Forward Network (MLFFN), which we find performs best, and use the MLFFN to create the first-ever empathy lexicon. We then use Signed Spectral Clustering to gain insights into the resulting words.
Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be fine-tuned for text emotion recognition, achieving an accuracy of 69.5% on Task 4A of SemEval 2017, improving upon the previous state of the art by over 3% absolute. We combine these language models with speech emotion recognition, achieving results of 73.5% accuracy when using provided transcriptions and speech data on a subset of four classes of the IEMOCAP dataset. The use of noise-induced transcriptions and speech data results in an accuracy of 71.4%. For our experiments, we created IEmoNet, a modular and adaptable bimodal framework for speech emotion recognition based on pre-trained language models. Lastly, we discuss the idea of using an emotional classifier as a reward for reinforcement learning as a step towards more successful and convenient human-machine interaction.
Variational Autoencoders (VAEs) hold great potential for modelling text, as they could in theory separate high-level semantic and syntactic properties from local regularities of natural language. Practically, however, VAEs with autoregressive decoders often suffer from posterior collapse, a phenomenon where the model learns to ignore the latent variables, causing the sequence VAE to degenerate into a language model. Previous works attempt to solve this problem with complex architectural changes or costly optimization schemes. In this paper, we argue that posterior collapse is caused in part by the encoder network failing to capture the input variabilities. We verify this hypothesis empirically and propose a straightforward fix using pooling. This simple technique effectively prevents posterior collapse, allowing the model to achieve significantly better data log-likelihood than standard sequence VAEs. Compared to the previous SOTA on preventing posterior collapse, we are able to achieve comparable performances while being significantly faster.
Given a natural language phrase, relation linking aims to find a relation (predicate or property) from the underlying knowledge graph to match the phrase. It is very useful in many applications, such as natural language question answering, personalized recommendation and text summarization. However, the previous relation linking algorithms usually produce a single relation for the input phrase and pay little attention to a more general and challenging problem, i.e., combinational relation linking that extracts a subgraph pattern to match the compound phrase (e.g. mother-in-law). In this paper, we focus on the task of combinational relation linking over knowledge graphs. To resolve the problem, we design a systematic method based on the data-driven relation assembly technique, which is performed under the guidance of meta patterns. We also introduce external knowledge to enhance the system understanding ability. Finally, we conduct extensive experiments over the real knowledge graph to study the performance of the proposed method.
Neural models achieved considerable improvement for many natural language processing tasks, but they offer little transparency, and interpretability comes at a cost. In some domains, automated predictions without justifications have limited applicability. Recently, progress has been made regarding single-aspect sentiment analysis for reviews, where the ambiguity of a justification is minimal. In this context, a justification, or mask, consists of (long) word sequences from the input text, which suffice to make the prediction. Existing models cannot handle more than one aspect in one training and induce binary masks that might be ambiguous. In our work, we propose a neural model for predicting multi-aspect sentiments for reviews and generates a probabilistic multi-dimensional mask (one per aspect) simultaneously, in an unsupervised and multi-task learning manner. Our evaluation shows that on three datasets, in the beer and hotel domain, our model outperforms strong baselines and generates masks that are: strong feature predictors, meaningful, and interpretable.
Commonsense and background knowledge is required for a QA model to answer many nontrivial questions. Different from existing work on knowledge-aware QA, we focus on a more challenging task of leveraging external knowledge to generate answers in natural language for a given question with context. In this paper, we propose a new neural model, Knowledge-Enriched Answer Generator (KEAG), which is able to compose a natural answer by exploiting and aggregating evidence from all four information sources available: question, passage, vocabulary and knowledge. During the process of answer generation, KEAG adaptively determines when to utilize symbolic knowledge and which fact from the knowledge is useful. This allows the model to exploit external knowledge that is not explicitly stated in the given text, but that is relevant for generating an answer. The empirical study on public benchmark of answer generation demonstrates that KEAG improves answer quality over models without knowledge and existing knowledge-aware models, confirming its effectiveness in leveraging knowledge.
Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes to that of image-level annotations, even cheaper supervision is naturally available in the form of unstructured textual descriptions that users may freely provide when uploading image content. However, straightforward approaches to using such data for WSOD wastefully discard captions that do not exactly match object names. Instead, we show how to squeeze the most information out of these captions by training a text-only classifier that generalizes beyond dataset boundaries. Our discovery provides an opportunity for learning detection models from noisy but more abundant and freely-available caption data. We also validate our model on three classic object detection benchmarks and achieve state-of-the-art WSOD performance. Our code is available at https://github.com/yekeren/Cap2Det.
Scripts are an important part of any TV series. They narrate movements, actions and expressions of characters. In this paper, a case study is presented on how different sequence to sequence deep learning models perform in the task of generating new conversations between characters as well as new scenarios on the basis of a script (previous conversations). A comprehensive comparison between these models, namely, LSTM, GRU and Bidirectional RNN is presented. All the models are designed to learn the sequence of recurring characters from the input sequence. Each input sequence will contain, say "n" characters, and the corresponding targets will contain the same number of characters, except, they will be shifted one character to the right. In this manner, input and output sequences are generated and used to train the models. A closer analysis of explored models performance and efficiency is delineated with the help of graph plots and generated texts by taking some input string. These graphs describe both, intraneural performance and interneural model performance for each model.
We address the problem of video moment localization with natural language, i.e. localizing a video segment described by a natural language sentence. While most prior work focuses on grounding the query as a whole, temporal dependencies and reasoning between events within the text are not fully considered. In this paper, we propose a novel Temporal Compositional Modular Network (TCMN) where a tree attention network first automatically decomposes a sentence into three descriptions with respect to the main event, context event and temporal signal. Two modules are then utilized to measure the visual similarity and location similarity between each segment and the decomposed descriptions. Moreover, since the main event and context event may rely on different modalities (RGB or optical flow), we use late fusion to form an ensemble of four models, where each model is independently trained by one combination of the visual input. Experiments show that our model outperforms the state-of-the-art methods on the TEMPO dataset.
Humans understand language based on the rich background knowledge about how the physical world works, which in turn allows us to reason about the physical world through language. In addition to the properties of objects (e.g., boats require fuel) and their affordances, i.e., the actions that are applicable to them (e.g., boats can be driven), we can also reason about if-then inferences between what properties of objects imply the kind of actions that are applicable to them (e.g., that if we can drive something then it likely requires fuel). In this paper, we investigate the extent to which state-of-the-art neural language representations, trained on a vast amount of natural language text, demonstrate physical commonsense reasoning. While recent advancements of neural language models have demonstrated strong performance on various types of natural language inference tasks, our study based on a dataset of over 200k newly collected annotations suggests that neural language representations still only learn associations that are explicitly written down.