As humans, we can often detect from a persons utterances if he or she is in favor of or against a given target entity (topic, product, another person, etc). But from the perspective of a computer, we need means to automatically deduce the stance of the tweeter, given just the tweet text. In this paper, we present our results of performing stance detection on twitter data using a supervised approach. We begin by extracting bag-of-words to perform classification using TIMBL, then try and optimize the features to improve stance detection accuracy, followed by extending the dataset with two sets of lexicons - arguing, and MPQA subjectivity; next we explore the MALT parser and construct features using its dependency triples, finally we perform analysis using Scikit-learn Random Forest implementation.
With the advances in both stable interest region detectors and robust and distinctive descriptors, local feature-based image or object retrieval has become a popular research topic. %All of the local feature-based image retrieval system involves two important processes: local feature extraction and image representation. The other key technology for image retrieval systems is image representation such as the bag-of-visual words (BoVW), Fisher vector, or Vector of Locally Aggregated Descriptors (VLAD) framework. In this paper, we review local features and image representations for image retrieval. Because many and many methods are proposed in this area, these methods are grouped into several classes and summarized. In addition, recent deep learning-based approaches for image retrieval are briefly reviewed.
Modelling, simulation and optimization form an integrated part of modern design practice in engineering and industry. Tremendous progress has been observed for all three components over the last few decades. However, many challenging issues remain unresolved, and the current trends tend to use nature-inspired algorithms and surrogate-based techniques for modelling and optimization. This 4th workshop on Computational Optimization, Modelling and Simulation (COMS 2013) at ICCS 2013 will further summarize the latest developments of optimization and modelling and their applications in science, engineering and industry. In this review paper, we will analyse the recent trends in modelling and optimization, and their associated challenges. We will discuss important topics for further research, including parameter-tuning, large-scale problems, and the gaps between theory and applications.
Experimental life sciences like biology or chemistry have seen in the recent decades an explosion of the data available from experiments. Laboratory instruments become more and more complex and report hundreds or thousands measurements for a single experiment and therefore the statistical methods face challenging tasks when dealing with such high dimensional data. However, much of the data is highly redundant and can be efficiently brought down to a much smaller number of variables without a significant loss of information. The mathematical procedures making possible this reduction are called dimensionality reduction techniques; they have widely been developed by fields like Statistics or Machine Learning, and are currently a hot research topic. In this review we categorize the plethora of dimension reduction techniques available and give the mathematical insight behind them.
We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing (LSI), showing a precise relationship between its performance and the uniformity of the underlying distribution of documents over topics. This analysis helps explain the improvements gained by Ando's (2000) Iterative Residual Rescaling (IRR) algorithm: IRR can compensate for distributional non-uniformity. A further benefit of our framework is that it provides a well-motivated, effective method for automatically determining the rescaling factor IRR depends on, leading to further improvements. A series of experiments over various settings and with several evaluation metrics validates our claims.
In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pre-training and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred.
Calibration and uncertainty estimation are crucial topics in high-risk environments. We introduce a new diversity regularizer for classification tasks that uses out-of-distribution samples and increases the overall accuracy, calibration and out-of-distribution detection capabilities of ensembles. Following the recent interest in the diversity of ensembles, we systematically evaluate the viability of explicitly regularizing ensemble diversity to improve calibration on in-distribution data as well as under dataset shift. We demonstrate that diversity regularization is highly beneficial in architectures, where weights are partially shared between the individual members and even allows to use fewer ensemble members to reach the same level of robustness. Experiments on CIFAR-10, CIFAR-100, and SVHN show that regularizing diversity can have a significant impact on calibration and robustness, as well as out-of-distribution detection.
In this paper, we use a large-scale play scripts dataset to propose the novel task of theatrical cue generation from dialogues. Using over one million lines of dialogue and cues, we approach the problem of cue generation as a controlled text generation task, and show how cues can be used to enhance the impact of dialogue using a language model conditioned on a dialogue/cue discriminator. In addition, we explore the use of topic keywords and emotions for controlled text generation. Extensive quantitative and qualitative experiments show that language models can be successfully used to generate plausible and attribute-controlled texts in highly specialised domains such as play scripts. Supporting materials can be found at: https://catlab-team.github.io/cuegen.
We consider the problem of generating adversarial malware by a cyber-attacker where the attacker's task is to strategically modify certain bytes within existing binary malware files, so that the modified files are able to evade a malware detector such as machine learning-based malware classifier. We have evaluated three recent adversarial malware generation techniques using binary malware samples drawn from a single, publicly available malware data set and compared their performances for evading a machine-learning based malware classifier called MalConv. Our results show that among the compared techniques, the most effective technique is the one that strategically modifies bytes in a binary's header. We conclude by discussing the lessons learned and future research directions on the topic of adversarial malware generation.
Inspiration moves a person to see new possibilities and transforms the way they perceive their own potential. Inspiration has received little attention in psychology, and has not been researched before in the NLP community. To the best of our knowledge, this work is the first to study inspiration through machine learning methods. We aim to automatically detect inspiring content from social media data. To this end, we analyze social media posts to tease out what makes a post inspiring and what topics are inspiring. We release a dataset of 5,800 inspiring and 5,800 non-inspiring English-language public post unique ids collected from a dump of Reddit public posts made available by a third party and use linguistic heuristics to automatically detect which social media English-language posts are inspiring.