Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Recommending Researchers in Machine Learning based on Author-Topic Model

Sep 05, 2021
Deepak Sharma, Bijendra Kumar, Satish Chand

The aim of this paper is to uncover the researchers in machine learning using the author-topic model (ATM). We collect 16,855 scientific papers from six top journals in the field of machine learning published from 1997 to 2016 and analyze them using ATM. The dataset is broken down into 4 intervals to identify the top researchers and find similar researchers using their similarity score. The similarity score is calculated using Hellinger distance. The researchers are plotted using t-SNE, which reduces the dimensionality of the data while keeping the same distance between the points. The analysis of our study helps the upcoming researchers to find the top researchers in their area of interest.

* 12 Pages,6 Figures, Presented in conference 

  Access Paper or Ask Questions

Robust Document Representations using Latent Topics and Metadata

Oct 23, 2020
Natraj Raman, Armineh Nourbakhsh, Sameena Shah, Manuela Veloso

Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata artifacts in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings. Specifically, we employ a pre-learned topic model distribution as surrogate labels and construct a loss function based on KL divergence. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The generated document embeddings exhibit compositional characteristics and are directly used by downstream classification tasks to create decision boundaries from a small number of labeled examples, thereby eschewing complicated recognition methods. We demonstrate through extensive evaluation that our proposed cross-model fusion solution outperforms several competitive baselines on multiple datasets.

* 9 pages, 7 figures 

  Access Paper or Ask Questions

The Sparse Hausdorff Moment Problem, with Application to Topic Models

Jul 22, 2020
Spencer Gordon, Bijan Mazaheri, Leonard J. Schulman, Yuval Rabani

We consider the problem of identifying, from its first $m$ noisy moments, a probability distribution on $[0,1]$ of support $k<\infty$. This is equivalent to the problem of learning a distribution on $m$ observable binary random variables $X_1,X_2,\dots,X_m$ that are iid conditional on a hidden random variable $U$ taking values in $\{1,2,\dots,k\}$. Our focus is on accomplishing this with $m=2k$, which is the minimum $m$ for which verifying that the source is a $k$-mixture is possible (even with exact statistics). This problem, so simply stated, is quite useful: e.g., by a known reduction, any algorithm for it lifts to an algorithm for learning pure topic models. In past work on this and also the more general mixture-of-products problem ($X_i$ independent conditional on $U$, but not necessarily iid), a barrier at $m^{O(k^2)}$ on the sample complexity and/or runtime of the algorithm was reached. We improve this substantially. We show it suffices to use a sample of size $\exp(k\log k)$ (with $m=2k$). It is known that the sample complexity of any solution to the identification problem must be $\exp(\Omega(k))$. Stated in terms of the moment problem, it suffices to know the moments to additive accuracy $\exp(-k\log k)$. Our run-time for the moment problem is only $O(k^{2+o(1)})$ arithmetic operations.

  Access Paper or Ask Questions

Viola: A Topic Agnostic Generate-and-Rank Dialogue System

Aug 25, 2021
Hyundong Cho, Basel Shbita, Kartik Shenoy, Shuai Liu, Nikhil Patel, Hitesh Pindikanti, Jennifer Lee, Jonathan May

We present Viola, an open-domain dialogue system for spoken conversation that uses a topic-agnostic dialogue manager based on a simple generate-and-rank approach. Leveraging recent advances of generative dialogue systems powered by large language models, Viola fetches a batch of response candidates from various neural dialogue models trained with different datasets and knowledge-grounding inputs. Additional responses originating from template-based generators are also considered, depending on the user's input and detected entities. The hand-crafted generators build on a dynamic knowledge graph injected with rich content that is crawled from the web and automatically processed on a daily basis. Viola's response ranker is a fine-tuned polyencoder that chooses the best response given the dialogue history. While dedicated annotations for the polyencoder alone can indirectly steer it away from choosing problematic responses, we add rule-based safety nets to detect neural degeneration and a dedicated classifier to filter out offensive content. We analyze conversations that Viola took part in for the Alexa Prize Socialbot Grand Challenge 4 and discuss the strengths and weaknesses of our approach. Lastly, we suggest future work with a focus on curating conversation data specifcially for socialbots that will contribute towards a more robust data-driven socialbot.

* Alexa Prize Socialbot Grand Challenge 4 Proceedings, 23 pages 

  Access Paper or Ask Questions

Common Topics and Coherent Situations: Interpreting Ellipsis in the Context of Discourse Inference

May 03, 1994
Andrew Kehler

It is claimed that a variety of facts concerning ellipsis, event reference, and interclausal coherence can be explained by two features of the linguistic form in question: (1) whether the form leaves behind an empty constituent in the syntax, and (2) whether the form is anaphoric in the semantics. It is proposed that these features interact with one of two types of discourse inference, namely {\it Common Topic} inference and {\it Coherent Situation} inference. The differing ways in which these types of inference utilize syntactic and semantic representations predicts phenomena for which it is otherwise difficult to account.

* ACL-94, Las Cruces, New Mexico 
* To be presented at ACL-94. 13 pages, LaTeX source, accompanying PostScript figures, requires psfig and lingmacros. Comments are welcome 

  Access Paper or Ask Questions

Topic Detection and Tracking with Time-Aware Document Embeddings

Dec 12, 2021
Hang Jiang, Doug Beeferman, Weiquan Mao, Deb Roy

The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.

  Access Paper or Ask Questions

Topic Modeling the Reading and Writing Behavior of Information Foragers

Jun 30, 2019
Jaimie Murdock

The general problem of "information foraging" in an environment about which agents have incomplete information has been explored in many fields, including cognitive psychology, neuroscience, economics, finance, ecology, and computer science. In all of these areas, the searcher aims to enhance future performance by surveying enough of existing knowledge to orient themselves in the information space. Individuals can be viewed as conducting a cognitive search in which they must balance exploration of ideas that are novel to them against exploitation of knowledge in domains in which they are already expert. In this dissertation, I present several case studies that demonstrate how reading and writing behaviors interact to construct personal knowledge bases. These studies use LDA topic modeling to represent the information environment of the texts each author read and wrote. Three studies revolve around Charles Darwin. Darwin left detailed records of every book he read for 23 years, from disembarking from the H.M.S. Beagle to just after publication of The Origin of Species. Additionally, he left copies of his drafts before publication. I characterize his reading behavior, then show how that reading behavior interacted with the drafts and subsequent revisions of The Origin of Species, and expand the dataset to include later readings and writings. Then, through a study of Thomas Jefferson's correspondence, I expand the study to non-book data. Finally, through an examination of neuroscience citation data, I move from individual behavior to collective behavior in constructing an information environment. Together, these studies reveal "the interplay between individual and collective phenomena where innovation takes place" (Tria et al. 2014).

* Accepted Ph.D. dissertation, Indiana University, Informatics (Complex Systems) and Cognitive Science, June 2019 

  Access Paper or Ask Questions

Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Oct 09, 2013
Jun Zhu, Xun Zheng, Bo Zhang

Supervised topic models with a logistic likelihood have two issues that potentially limit their practical use: 1) response variables are usually over-weighted by document word counts; and 2) existing variational inference methods make strict mean-field assumptions. We address these issues by: 1) introducing a regularization constant to better balance the two parts based on an optimization formulation of Bayesian inference; and 2) developing a simple Gibbs sampling algorithm by introducing auxiliary Polya-Gamma variables and collapsing out Dirichlet variables. Our augment-and-collapse sampling algorithm has analytical forms of each conditional distribution without making any restricting assumptions and can be easily parallelized. Empirical results demonstrate significant improvements on prediction performance and time efficiency.

* 9 pages, ACL 2013 

  Access Paper or Ask Questions

Zero-Shot Object Recognition System based on Topic Model

Oct 14, 2014
Wai Lam Hoo, Chee Seng Chan

Object recognition systems usually require fully complete manually labeled training data to train the classifier. In this paper, we study the problem of object recognition where the training samples are missing during the classifier learning stage, a task also known as zero-shot learning. We propose a novel zero-shot learning strategy that utilizes the topic model and hierarchical class concept. Our proposed method advanced where cumbersome human annotation stage (i.e. attribute-based classification) is eliminated. We achieve comparable performance with state-of-the-art algorithms in four public datasets: PubFig (67.09%), Cifar-100 (54.85%), Caltech-256 (52.14%), and Animals with Attributes (49.65%) when unseen classes exist in the classification task.

* To appear in IEEE Transactions on Human-Machine Systems 

  Access Paper or Ask Questions