Community-based question answering (CQA) websites represent an important source of information. As a result, the problem of matching the most valuable answers to their corresponding questions has become an increasingly popular research topic. We frame this task as a binary (relevant/irrelevant) classification problem, and propose a Multi-scale Matching model that inspects the correlation between words and ngrams (word-to-ngrams) of different levels of granularity. This is in addition to word-to-word correlations which are used in most prior work. In this way, our model is able to capture rich context information conveyed in ngrams, therefore can better differentiate good answers from bad ones. Furthermore, we present an adversarial training framework to iteratively generate challenging negative samples to fool the proposed classification model. This is completely different from previous methods, where negative samples are uniformly sampled from the dataset during training process. The proposed method is evaluated on SemEval 2017 and Yahoo Answer dataset and achieves state-of-the-art performance.
Wood-composite materials are widely used today as they homogenize humidity related directional deformations. Quantification of these deformations as coefficients is important for construction and engineering and topic of current research but still a manual process. This work introduces a novel computer vision approach that automatically extracts these properties directly from scans of the wooden specimens, taken at different humidity levels during the long lasting humidity conditioning process. These scans are used to compute a humidity dependent deformation field for each pixel, from which the desired coefficients can easily be calculated. The overall method includes automated registration of the wooden blocks, numerical optimization to compute a variational optical flow field which is further used to calculate dense strain fields and finally the engineering coefficients and their variance throughout the wooden blocks. The methods regularization is fully parameterizable which allows to model and suppress artifacts due to surface appearance changes of the specimens from mold, cracks, etc. that typically arise in the conditioning process.
Artificial Intelligence is a central topic in the computer science curriculum. From the year 2011 a project-based learning methodology based on computer games has been designed and implemented into the intelligence artificial course at the University of the Bio-Bio. The project aims to develop software-controlled agents (bots) which are programmed by using heuristic algorithms seen during the course. This methodology allows us to obtain good learning results, however several challenges have been founded during its implementation. In this paper we show how linguistic descriptions of data can help to provide students and teachers with technical and personalized feedback about the learned algorithms. Algorithm behavior profile and a new Turing test for computer games bots based on linguistic modelling of complex phenomena are also proposed in order to deal with such challenges. In order to show and explore the possibilities of this new technology, a web platform has been designed and implemented by one of authors and its incorporation in the process of assessment allows us to improve the teaching learning process.
Autoencoders have been successful in learning meaningful representations from image datasets. However, their performance on text datasets has not been widely studied. Traditional autoencoders tend to learn possibly trivial representations of text documents due to their confounding properties such as high-dimensionality, sparsity and power-law word distributions. In this paper, we propose a novel k-competitive autoencoder, called KATE, for text documents. Due to the competition between the neurons in the hidden layer, each neuron becomes specialized in recognizing specific data patterns, and overall the model can learn meaningful representations of textual data. A comprehensive set of experiments show that KATE can learn better representations than traditional autoencoders including denoising, contractive, variational, and k-sparse autoencoders. Our model also outperforms deep generative models, probabilistic topic models, and even word representation models (e.g., Word2Vec) in terms of several downstream tasks such as document classification, regression, and retrieval.
In this work we address the issue of generic automated disease incidence monitoring on twitter. We employ an ontology of disease related concepts and use it to obtain a conceptual representation of tweets. Unlike previous key word based systems and topic modeling approaches, our ontological approach allows us to apply more stringent criteria for determining which messages are relevant such as spatial and temporal characteristics whilst giving a stronger guarantee that the resulting models will perform well on new data that may be lexically divergent. We achieve this by training learners on concepts rather than individual words. For training we use a dataset containing mentions of influenza and Listeria and use the learned models to classify datasets containing mentions of an arbitrary selection of other diseases. We show that our ontological approach achieves good performance on this task using a variety of Natural Language Processing Techniques. We also show that word vectors can be learned directly from our concepts to achieve even better results.
The field of Machine Learning and the topic of clustering within it is still widely researched. Recently, researchers became interested in a new variant of hierarchical clustering, where hierarchical (partial order) relationships exist not only between clusters but also objects. In this variant of clustering, objects can be assigned not only to leave, but other properties are also defined. Although examples of this approach already exist in literature, the authors have encountered a problem with the analysis and comparison of obtained results. The problem is twofold. Firstly, there is a lack of evaluation methods. Secondly, there is a lack of available benchmark data, at least the authors failed to find them. The aim of this work is to fill the second gap. The main contribution of this paper is a new method of generating hierarchical structures of data. Additionally, the paper includes a theoretical analysis of the generation parameters and their influence on the results. Comprehensive experiments are presented and discussed. The dataset generator and visualiser tools developed are publicly available for use (http://kio.pwr.edu.pl/?page_id=396).
Identifying and communicating relationships between causes and effects is important for understanding our world, but is affected by language structure, cognitive and emotional biases, and the properties of the communication medium. Despite the increasing importance of social media, much remains unknown about causal statements made online. To study real-world causal attribution, we extract a large-scale corpus of causal statements made on the Twitter social network platform as well as a comparable random control corpus. We compare causal and control statements using statistical language and sentiment analysis tools. We find that causal statements have a number of significant lexical and grammatical differences compared with controls and tend to be more negative in sentiment than controls. Causal statements made online tend to focus on news and current events, medicine and health, or interpersonal relationships, as shown by topic models. By quantifying the features and potential biases of causality communication, this study improves our understanding of the accuracy of information and opinions found online.
TurboTax AnswerXchange is a social Q&A system supporting users working on federal and state tax returns. Using 2015 data, we demonstrate that content popularity (or number of views per AnswerXchange question) can be predicted with reasonable accuracy based on attributes of the question alone. We also employ probabilistic topic analysis and uplift modeling to identify question features with the highest impact on popularity. We demonstrate that content popularity is driven by behavioral attributes of AnswerXchange users and depends on complex interactions between search ranking algorithms, psycholinguistic factors and question writing style. Our findings provide a rationale for employing popularity predictions to guide the users into formulating better questions and editing the existing ones. For example, starting question title with a question word or adding details to the question increase number of views per question. Similar approach can be applied to promoting AnswerXchange content indexed by Google to drive organic traffic to TurboTax.
Normalized graph cut (NGC) has become a popular research topic due to its wide applications in a large variety of areas like machine learning and very large scale integration (VLSI) circuit design. Most of traditional NGC methods are based on pairwise relationships (similarities). However, in real-world applications relationships among the vertices (objects) may be more complex than pairwise, which are typically represented as hyperedges in hypergraphs. Thus, normalized hypergraph cut (NHC) has attracted more and more attention. Existing NHC methods cannot achieve satisfactory performance in real applications. In this paper, we propose a novel relaxation approach, which is called relaxed NHC (RNHC), to solve the NHC problem. Our model is defined as an optimization problem on the Stiefel manifold. To solve this problem, we resort to the Cayley transformation to devise a feasible learning algorithm. Experimental results on a set of large hypergraph benchmarks for clustering and partitioning in VLSI domain show that RNHC can outperform the state-of-the-art methods.
Rating Prediction is a basic problem in Recommender System, and one of the most widely used method is Factorization Machines(FM). However, traditional matrix factorization methods fail to utilize the benefit of implicit feedback, which has been proved to be important in Rating Prediction problem. In this work, we consider a specific situation, movie rating prediction, where we assume that watching history has a big influence on his/her rating behavior on an item. We introduce two models, Latent Dirichlet Allocation(LDA) and word2vec, both of which perform state-of-the-art results in training latent features. Based on that, we propose two feature based models. One is the Topic-based FM Model which provides the implicit feedback to the matrix factorization. The other is the Vector-based FM Model which expresses the order info of watching history. Empirical results on three datasets demonstrate that our method performs better than the baseline model and confirm that Vector-based FM Model usually works better as it contains the order info.