Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Apr 13, 2020
Federico Bianchi, Valerio Di Carlo, Paolo Nicoli, Matteo Palmonari

Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.

* arXiv admin note: text overlap with arXiv:1906.02376 

  Access Paper or Ask Questions

Idea density for predicting Alzheimer's disease from transcribed speech

Jun 14, 2017
Kairit Sirts, Olivier Piguet, Mark Johnson

Idea Density (ID) measures the rate at which ideas or elementary predications are expressed in an utterance or in a text. Lower ID is found to be associated with an increased risk of developing Alzheimer's disease (AD) (Snowdon et al., 1996; Engelman et al., 2010). ID has been used in two different versions: propositional idea density (PID) counts the expressed ideas and can be applied to any text while semantic idea density (SID) counts pre-defined information content units and is naturally more applicable to normative domains, such as picture description tasks. In this paper, we develop DEPID, a novel dependency-based method for computing PID, and its version DEPID-R that enables to exclude repeating ideas---a feature characteristic to AD speech. We conduct the first comparison of automatically extracted PID and SID in the diagnostic classification task on two different AD datasets covering both closed-topic and free-recall domains. While SID performs better on the normative dataset, adding PID leads to a small but significant improvement (+1.7 F-score). On the free-topic dataset, PID performs better than SID as expected (77.6 vs 72.3 in F-score) but adding the features derived from the word embedding clustering underlying the automatic SID increases the results considerably, leading to an F-score of 84.8.

* CoNLL 2017 

  Access Paper or Ask Questions

Markov Determinantal Point Processes

Oct 16, 2012
Raja Hafiz Affandi, Alex Kulesza, Emily B. Fox

A determinantal point process (DPP) is a random process useful for modeling the combinatorial problem of subset selection. In particular, DPPs encourage a random subset Y to contain a diverse set of items selected from a base set Y. For example, we might use a DPP to display a set of news headlines that are relevant to a user's interests while covering a variety of topics. Suppose, however, that we are asked to sequentially select multiple diverse sets of items, for example, displaying new headlines day-by-day. We might want these sets to be diverse not just individually but also through time, offering headlines today that are unlike the ones shown yesterday. In this paper, we construct a Markov DPP (M-DPP) that models a sequence of random sets {Yt}. The proposed M-DPP defines a stationary process that maintains DPP margins. Crucially, the induced union process Zt = Yt u Yt-1 is also marginally DPP-distributed. Jointly, these properties imply that the sequence of random sets are encouraged to be diverse both at a given time step as well as across time steps. We describe an exact, efficient sampling procedure, and a method for incrementally learning a quality measure over items in the base set Y based on external preferences. We apply the M-DPP to the task of sequentially displaying diverse and relevant news articles to a user with topic preferences.

* Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012) 

  Access Paper or Ask Questions

Understanding the role of single-board computers in engineering and computer science education: A systematic literature review

Mar 30, 2022
Jonathan Álvarez Ariza, Heyson Baez

In the last decade, Single-Board Computers (SBCs) have been employed more frequently in engineering and computer science both to technical and educational levels. Several factors such as the versatility, the low-cost, and the possibility to enhance the learning process through technology have contributed to the educators and students usually employ these devices. However, the implications, possibilities, and constraints of these devices in engineering and Computer Science (CS) education have not been explored in detail. In this systematic literature review, we explore how the SBCs are employed in engineering and computer science and what educational results are derived from their usage in the period 2010-2020 at tertiary education. For that, 154 studies were selected out of n=605 collected from the academic databases Ei Compendex, ERIC, and Inspec. The analysis was carried-out in two phases, identifying, e.g., areas of application, learning outcomes, and students and researchers' perceptions. The results mainly indicate the following aspects: (1) The areas of laboratories and e-learning, computing education, robotics, Internet of Things (IoT), and persons with disabilities gather the studies in the review. (2) Researchers highlight the importance of the SBCs to transform the curricula in engineering and CS for the students to learn complex topics through experimentation in hands-on activities. (3) The typical cognitive learning outcomes reported by the authors are the improvement of the students' grades and the technical skills regarding the topics in the courses. Concerning the affective learning outcomes, the increase of interest, motivation, and engagement are commonly reported by the authors.

* Computer applications in engineering education (2022); vol 30; pp 304-329 
* 27 pages 

  Access Paper or Ask Questions

A Large-Scale Rich Context Query and Recommendation Dataset in Online Knowledge-Sharing

Jun 11, 2021
Bin Hao, Min Zhang, Weizhi Ma, Shaoyun Shi, Xinxing Yu, Houzhi Shan, Yiqun Liu, Shaoping Ma

Data plays a vital role in machine learning studies. In the research of recommendation, both user behaviors and side information are helpful to model users. So, large-scale real scenario datasets with abundant user behaviors will contribute a lot. However, it is not easy to get such datasets as most of them are only hold and protected by companies. In this paper, a new large-scale dataset collected from a knowledge-sharing platform is presented, which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query keywords. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. Note that each user's latest query keywords have not been included in previous open datasets, which reveal users' explicit information needs. We characterize the dataset and demonstrate its potential applications for recommendation study. Multiple experiments show the dataset can be used to evaluate algorithms in general top-N recommendation, sequential recommendation, and context-aware recommendation. This dataset can also be used to integrate search and recommendation and recommendation with negative feedback. Besides, tasks beyond recommendation, such as user gender prediction, most valuable answerer identification, and high-quality answer recognition, can also use this dataset. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.

* 7 pages 

  Access Paper or Ask Questions

Query-oriented text summarization based on hypergraph transversals

Feb 02, 2019
Hadrien Van Lierde, Tommy W. S. Chow

Existing graph- and hypergraph-based algorithms for document summarization represent the sentences of a corpus as the nodes of a graph or a hypergraph in which the edges represent relationships of lexical similarities between sentences. Each sentence of the corpus is then scored individually, using popular node ranking algorithms, and a summary is produced by extracting highly scored sentences. This approach fails to select a subset of jointly relevant sentences and it may produce redundant summaries that are missing important topics of the corpus. To alleviate this issue, a new hypergraph-based summarizer is proposed in this paper, in which each node is a sentence and each hyperedge is a theme, namely a group of sentences sharing a topic. Themes are weighted in terms of their prominence in the corpus and their relevance to a user-defined query. It is further shown that the problem of identifying a subset of sentences covering the relevant themes of the corpus is equivalent to that of finding a hypergraph transversal in our theme-based hypergraph. Two extensions of the notion of hypergraph transversal are proposed for the purpose of summarization, and polynomial time algorithms building on the theory of submodular functions are proposed for solving the associated discrete optimization problems. The worst-case time complexity of the proposed algorithms is squared in the number of terms, which makes it cheaper than the existing hypergraph-based methods. A thorough comparative analysis with related models on DUC benchmark datasets demonstrates the effectiveness of our approach, which outperforms existing graph- or hypergraph-based methods by at least 6% of ROUGE-SU4 score.

* This is the unrefereed Author's Original Version (or pre-print Version) of the article 

  Access Paper or Ask Questions

CoVeR: Learning Covariate-Specific Vector Representations with Tensor Decompositions

Jul 07, 2018
Kevin Tian, Teng Zhang, James Zou

Word embedding is a useful approach to capture co-occurrence structures in large text corpora. However, in addition to the text data itself, we often have additional covariates associated with individual corpus documents---e.g. the demographic of the author, time and venue of publication---and we would like the embedding to naturally capture this information. We propose CoVeR, a new tensor decomposition model for vector embeddings with covariates. CoVeR jointly learns a \emph{base} embedding for all the words as well as a weighted diagonal matrix to model how each covariate affects the base embedding. To obtain author or venue-specific embedding, for example, we can then simply multiply the base embedding by the associated transformation matrix. The main advantages of our approach are data efficiency and interpretability of the covariate transformation. Our experiments demonstrate that our joint model learns substantially better covariate-specific embeddings compared to the standard approach of learning a separate embedding for each covariate using only the relevant subset of data, as well as other related methods. Furthermore, CoVeR encourages the embeddings to be "topic-aligned" in that the dimensions have specific independent meanings. This allows our covariate-specific embeddings to be compared by topic, enabling downstream differential analysis. We empirically evaluate the benefits of our algorithm on datasets, and demonstrate how it can be used to address many natural questions about covariate effects. Accompanying code to this paper can be found at

* 12 pages. Appears in ICML 2018 

  Access Paper or Ask Questions

New Hybrid Neuro-Evolutionary Algorithms for Renewable Energy and Facilities Management Problems

Jun 05, 2018
L. Cornejo-Bueno

This Ph.D. thesis deals with the optimization of several renewable energy resources development as well as the improvement of facilities management in oceanic engineering and airports, using computational hybrid methods belonging to AI to this end. Energy is essential to our society in order to ensure a good quality of life. This means that predictions over the characteristics on which renewable energies depend are necessary, in order to know the amount of energy that will be obtained at any time. The second topic tackled in this thesis is related to the basic parameters that influence in different marine activities and airports, whose knowledge is necessary to develop a proper facilities management in these environments. Within this work, a study of the state-of-the-art Machine Learning have been performed to solve the problems associated with the topics above-mentioned, and several contributions have been proposed: One of the pillars of this work is focused on the estimation of the most important parameters in the exploitation of renewable resources. The second contribution of this thesis is related to feature selection problems. The proposed methodologies are applied to multiple problems: the prediction of $H_s$, relevant for marine energy applications and marine activities, the estimation of WPREs, undesirable variations in the electric power produced by a wind farm, the prediction of global solar radiation in areas from Spain and Australia, really important in terms of solar energy, and the prediction of low-visibility events at airports. All of these practical issues are developed with the consequent previous data analysis, normally, in terms of meteorological variables.

* arXiv admin note: text overlap with arXiv:1706.03673, arXiv:1805.03463 by other authors 

  Access Paper or Ask Questions

Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

Jan 28, 2021
Elena Zotova, Rodrigo Agerri, German Rigau

Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some efforts have recently been made to develop annotated data in other languages, there is a telling lack of resources to facilitate multilingual and crosslingual research on stance detection. This is partially due to the fact that manually annotating a corpus of social media texts is a difficult, slow and costly process. Furthermore, as stance is a highly domain- and topic-specific phenomenon, the need for annotated data is specially demanding. As a result, most of the manually labeled resources are hindered by their relatively small size and skewed class distribution. This paper presents a method to obtain multilingual datasets for stance detection in Twitter. Instead of manually annotating on a per tweet basis, we leverage user-based information to semi-automatically label large amounts of tweets. Empirical monolingual and cross-lingual experimentation and qualitative analysis show that our method helps to overcome the aforementioned difficulties to build large, balanced and multilingual labeled corpora. We believe that our method can be easily adapted to easily generate labeled social media data for other Natural Language Processing tasks and domains.

* Expert Systems with Applications, 170 (2021), Elsevier 
* Stance detection, multilingualism, text categorization, fake news, deep learning 

  Access Paper or Ask Questions

ADMM-based Networked Stochastic Variational Inference

Feb 27, 2018
Hamza Anwar, Quanyan Zhu

Owing to the recent advances in "Big Data" modeling and prediction tasks, variational Bayesian estimation has gained popularity due to their ability to provide exact solutions to approximate posteriors. One key technique for approximate inference is stochastic variational inference (SVI). SVI poses variational inference as a stochastic optimization problem and solves it iteratively using noisy gradient estimates. It aims to handle massive data for predictive and classification tasks by applying complex Bayesian models that have observed as well as latent variables. This paper aims to decentralize it allowing parallel computation, secure learning and robustness benefits. We use Alternating Direction Method of Multipliers in a top-down setting to develop a distributed SVI algorithm such that independent learners running inference algorithms only require sharing the estimated model parameters instead of their private datasets. Our work extends the distributed SVI-ADMM algorithm that we first propose, to an ADMM-based networked SVI algorithm in which not only are the learners working distributively but they share information according to rules of a graph by which they form a network. This kind of work lies under the umbrella of `deep learning over networks' and we verify our algorithm for a topic-modeling problem for corpus of Wikipedia articles. We illustrate the results on latent Dirichlet allocation (LDA) topic model in large document classification, compare performance with the centralized algorithm, and use numerical experiments to corroborate the analytical results.

* to be submitted for publishing 

  Access Paper or Ask Questions