Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Learning by Examples Based on Multi-level Optimization

Sep 22, 2021
Shentong Mo, Pengtao Xie

Learning by examples, which learns to solve a new problem by looking into how similar problems are solved, is an effective learning method in human learning. When a student learns a new topic, he/she finds out exemplar topics that are similar to this new topic and studies the exemplar topics to deepen the understanding of the new topic. We aim to investigate whether this powerful learning skill can be borrowed from humans to improve machine learning as well. In this work, we propose a novel learning approach called Learning By Examples (LBE). Our approach automatically retrieves a set of training examples that are similar to query examples and predicts labels for query examples by using class labels of the retrieved examples. We propose a three-level optimization framework to formulate LBE which involves three stages of learning: learning a Siamese network to retrieve similar examples; learning a matching network to make predictions on query examples by leveraging class labels of retrieved similar examples; learning the ``ground-truth'' similarities between training examples by minimizing the validation loss. We develop an efficient algorithm to solve the LBE problem and conduct extensive experiments on various benchmarks where the results demonstrate the effectiveness of our method on both supervised and few-shot learning.

  Access Paper or Ask Questions

HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

Oct 16, 2019
Yu Zhang, Frank F. Xu, Sha Li, Yu Meng, Xuan Wang, Qi Li, Jiawei Han

GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as \textit{keyword-driven hierarchical classification}. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the \textsc{HiGitClass} framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that \textsc{HiGitClass} is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification.

* 10 pages; Accepted to ICDM 2019 

  Access Paper or Ask Questions

CODEC: Complex Document and Entity Collection

May 09, 2022
Iain Mackie, Paul Owoicho, Carlos Gemmell, Sophie Fischer, Sean MacAvaney, Jeffrey Dalton

CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation. CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains on document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve the performance of document and entity ranking. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods.

* 10 pages, SIGIR 2022 Preprint 

  Access Paper or Ask Questions

Knowledge forest: a novel model to organize knowledge fragments

Dec 14, 2019
Qinghua Zheng, Jun Liu, Hongwei Zeng, Zhaotong Guo, Bei Wu, Bifan Wei

With the rapid growth of knowledge, it shows a steady trend of knowledge fragmentization. Knowledge fragmentization manifests as that the knowledge related to a specific topic in a course is scattered in isolated and autonomous knowledge sources. We term the knowledge of a facet in a specific topic as a knowledge fragment. The problem of knowledge fragmentization brings two challenges: First, knowledge is scattered in various knowledge sources, which exerts users' considerable efforts to search for the knowledge of their interested topics, thereby leading to information overload. Second, learning dependencies which refer to the precedence relationships between topics in the learning process are concealed by the isolation and autonomy of knowledge sources, thus causing learning disorientation. To solve the knowledge fragmentization problem, we propose a novel knowledge organization model, knowledge forest, which consists of facet trees and learning dependencies. Facet trees can organize knowledge fragments with facet hyponymy to alleviate information overload. Learning dependencies can organize disordered topics to cope with learning disorientation. We conduct extensive experiments on three manually constructed datasets from the Data Structure, Data Mining, and Computer Network courses, and the experimental results show that knowledge forest can effectively organize knowledge fragments, and alleviate information overload and learning disorientation.

* 3 pages, 1 figures, Accepted for publication in Science China Information Science 

  Access Paper or Ask Questions

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

Oct 08, 2020
Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, Xiaodan Liang

Automatically evaluating dialogue coherence is a challenging but high-demand ability for developing high-quality open-domain dialogue systems. However, current evaluation metrics consider only surface features or utterance-level semantics, without explicitly considering the fine-grained topic transition dynamics of dialogue flows. Here, we first consider that the graph structure constituted with topics in a dialogue can accurately depict the underlying communication logic, which is a more natural way to produce persuasive metrics. Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation. Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence. The graph representations are obtained by reasoning over topic-level dialogue graphs enhanced with the evidence from a commonsense graph, including k-hop neighboring representations and hop-attention weights. Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models in terms of the Pearson and Spearman correlations with human judgements. Besides, we release a new large-scale human evaluation benchmark to facilitate future research on automatic metrics.

* Long paper; EMNLP2020 

  Access Paper or Ask Questions

TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency

Feb 27, 2017
Adji B. Dieng, Chong Wang, Jianfeng Gao, John Paisley

In this paper, we propose TopicRNN, a recurrent neural network (RNN)-based language model designed to directly capture the global semantic meaning relating words in a document via latent topics. Because of their sequential nature, RNNs are good at capturing the local structure of a word sequence - both semantic and syntactic - but might face difficulty remembering long-range dependencies. Intuitively, these long-range dependencies are of semantic nature. In contrast, latent topic models are able to capture the global underlying semantic structure of a document but do not account for word ordering. The proposed TopicRNN model integrates the merits of RNNs and latent topic models: it captures local (syntactic) dependencies using an RNN and global (semantic) dependencies using latent topics. Unlike previous work on contextual RNN language modeling, our model is learned end-to-end. Empirical results on word prediction show that TopicRNN outperforms existing contextual RNN baselines. In addition, TopicRNN can be used as an unsupervised feature extractor for documents. We do this for sentiment analysis on the IMDB movie review dataset and report an error rate of $6.28\%$. This is comparable to the state-of-the-art $5.91\%$ resulting from a semi-supervised approach. Finally, TopicRNN also yields sensible topics, making it a useful alternative to document models such as latent Dirichlet allocation.

* International Conference on Learning Representations 

  Access Paper or Ask Questions

Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

Sep 18, 2013
Yuval Rabani, Leonard Schulman, Chaitanya Swamy

We give an algorithm for learning a mixture of {\em unstructured} distributions. This problem arises in various unsupervised learning scenarios, for example in learning {\em topic models} from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of $k$ arbitrary distributions over a large discrete domain $[n]=\{1,2,\dots,n\}$ and the mixture weights, using $O(n\polylog n)$ samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for $k>1$ under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the {\em "sampling aperture"}, is a crucial parameter of the problem. We obtain the {\em first} bounds for this mixture-learning problem {\em without imposing any assumptions on the mixture constituents.} We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of $2k-1$. Thus, we achieve near-optimal dependence on $n$ and optimal aperture. While the sample-size required by our algorithm depends exponentially on $k$, we prove that such a dependence is {\em unavoidable} when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.

* Update of previous version with improved aperture and sample-size lower bounds 

  Access Paper or Ask Questions

Modeling Social Annotation: a Bayesian Approach

May 26, 2010
Anon Plangprasopchok, Kristina Lerman

Collaborative tagging systems, such as Delicious, CiteULike, and others, allow users to annotate resources, e.g., Web pages or scientific papers, with descriptive labels called tags. The social annotations contributed by thousands of users, can potentially be used to infer categorical knowledge, classify documents or recommend new relevant information. Traditional text inference methods do not make best use of social annotation, since they do not take into account variations in individual users' perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes interests of individual annotators into account in order to find hidden topics of annotated resources. Unfortunately, that approach had one major shortcoming: the number of topics and interests must be specified a priori. To address this drawback, we extend the model to a fully Bayesian framework, which offers a way to automatically estimate these numbers. In particular, the model allows the number of interests and topics to change as suggested by the structure of the data. We evaluate the proposed model in detail on the synthetic and real-world data by comparing its performance to Latent Dirichlet Allocation on the topic extraction task. For the latter evaluation, we apply the model to infer topics of Web resources from social annotations obtained from Delicious in order to discover new resources similar to a specified one. Our empirical results demonstrate that the proposed model is a promising method for exploiting social knowledge contained in user-generated annotations.

* 29 Pages, Accepted for publication at ACM Transactions on Knowledge Discovery from Data(TKDD) on March 2, 2010 

  Access Paper or Ask Questions

Factorized Topic Models

Apr 23, 2013
Cheng Zhang, Carl Henrik Ek, Andreas Damianou, Hedvig Kjellstrom

In this paper we present a modification to a latent topic model, which makes the model exploit supervision to produce a factorized representation of the observed data. The structured parameterization separately encodes variance that is shared between classes from variance that is private to each class by the introduction of a new prior over the topic space. The approach allows for a more eff{}icient inference and provides an intuitive interpretation of the data in terms of an informative signal together with structured noise. The factorized representation is shown to enhance inference performance for image, text, and video classification.

* ICLR 2013 

  Access Paper or Ask Questions

Extracting and Analyzing Semantic Relatedness between Cities Using News Articles

Sep 08, 2018
Yingjie Hu, Xinyue Ye, Shih-Lung Shaw

News articles capture a variety of topics about our society. They reflect not only the socioeconomic activities that happened in our physical world, but also some of the cultures, human interests, and public concerns that exist only in the perceptions of people. Cities are frequently mentioned in news articles, and two or more cities may co-occur in the same article. Such co-occurrence often suggests certain relatedness between the mentioned cities, and the relatedness may be under different topics depending on the contents of the news articles. We consider the relatedness under different topics as semantic relatedness. By reading news articles, one can grasp the general semantic relatedness between cities, yet, given hundreds of thousands of news articles, it is very difficult, if not impossible, for anyone to manually read them. This paper proposes a computational framework which can "read" a large number of news articles and extract the semantic relatedness between cities. This framework is based on a natural language processing model and employs a machine learning process to identify the main topics of news articles. We describe the overall structure of this framework and its individual modules, and then apply it to an experimental dataset with more than 500,000 news articles covering the top 100 U.S. cities spanning a 10-year period. We perform exploratory visualization of the extracted semantic relatedness under different topics and over multiple years. We also analyze the impact of geographic distance on semantic relatedness and find varied distance decay effects. The proposed framework can be used to support large-scale content analysis in city network research.

* International Journal of Geographical Information Science, 2017 

  Access Paper or Ask Questions