Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Recommendation": models, code, and papers

Recovering Structured Probability Matrices

Feb 06, 2018
Qingqing Huang, Sham M. Kakade, Weihao Kong, Gregory Valiant

We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient algorithm that accurately recovers the underlying M by M matrix using Theta(M) samples. This result easily translates to Theta(M) sample algorithms for learning topic models and learning hidden Markov Models. These linear sample complexities are optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. We provide an even stronger lower bound where distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by an HMM with two hidden states requires Omega(M) observations. This precludes sublinear-sample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs.

  Access Paper or Ask Questions

Query-driven Procedures for Hybrid MKNF Knowledge Bases

Dec 09, 2011
José Júlio Alferes, Matthias Knorr, Terrance Swift

Hybrid MKNF knowledge bases are one of the most prominent tightly integrated combinations of open-world ontology languages with closed-world (non-monotonic) rule paradigms. The definition of Hybrid MKNF is parametric on the description logic (DL) underlying the ontology language, in the sense that non-monotonic rules can extend any decidable DL language. Two related semantics have been defined for Hybrid MKNF: one that is based on the Stable Model Semantics for logic programs and one on the Well-Founded Semantics (WFS). Under WFS, the definition of Hybrid MKNF relies on a bottom-up computation that has polynomial data complexity whenever the DL language is tractable. Here we define a general query-driven procedure for Hybrid MKNF that is sound with respect to the stable model-based semantics, and sound and complete with respect to its WFS variant. This procedure is able to answer a slightly restricted form of conjunctive queries, and is based on tabled rule evaluation extended with an external oracle that captures reasoning within the ontology. Such an (abstract) oracle receives as input a query along with knowledge already derived, and replies with a (possibly empty) set of atoms, defined in the rules, whose truth would suffice to prove the initial query. With appropriate assumptions on the complexity of the abstract oracle, the general procedure maintains the data complexity of the WFS for Hybrid MKNF knowledge bases. To illustrate this approach, we provide a concrete oracle for EL+, a fragment of the light-weight DL EL++. Such an oracle has practical use, as EL++ is the language underlying OWL 2 EL, which is part of the W3C recommendations for the Semantic Web, and is tractable for reasoning tasks such as subsumption. We show that query-driven Hybrid MKNF preserves polynomial data complexity when using the EL+ oracle and WFS.

* 48 pages with 1 figures, submitted to ACM TOCL 

  Access Paper or Ask Questions

Learning What You Need from What You Did: Product Taxonomy Expansion with User Behaviors Supervision

Mar 28, 2022
Sijie Cheng, Zhouhong Gu, Bang Liu, Rui Xie, Wei Wu, Yanghua Xiao

Taxonomies have been widely used in various domains to underpin numerous applications. Specially, product taxonomies serve an essential role in the e-commerce domain for the recommendation, browsing, and query understanding. However, taxonomies need to constantly capture the newly emerged terms or concepts in e-commerce platforms to keep up-to-date, which is expensive and labor-intensive if it relies on manual maintenance and updates. Therefore, we target the taxonomy expansion task to attach new concepts to existing taxonomies automatically. In this paper, we present a self-supervised and user behavior-oriented product taxonomy expansion framework to append new concepts into existing taxonomies. Our framework extracts hyponymy relations that conform to users' intentions and cognition. Specifically, i) to fully exploit user behavioral information, we extract candidate hyponymy relations that match user interests from query-click concepts; ii) to enhance the semantic information of new concepts and better detect hyponymy relations, we model concepts and relations through both user-generated content and structural information in existing taxonomies and user click logs, by leveraging Pre-trained Language Models and Graph Neural Network combined with Contrastive Learning; iii) to reduce the cost of dataset construction and overcome data skews, we construct a high-quality and balanced training dataset from existing taxonomy with no supervision. Extensive experiments on real-world product taxonomies in Meituan Platform, a leading Chinese vertical e-commerce platform to order take-out with more than 70 million daily active users, demonstrate the superiority of our proposed framework over state-of-the-art methods. Notably, our method enlarges the size of real-world product taxonomies from 39,263 to 94,698 relations with 88% precision.

* Accepted by ICDE'22 

  Access Paper or Ask Questions

Clinical Evidence Engine: Proof-of-Concept For A Clinical-Domain-Agnostic Decision Support Infrastructure

Oct 31, 2021
Bojian Hou, Hao Zhang, Gur Ladizhinsky, Gur Ladizhinsky, Stephen Yang, Volodymyr Kuleshov, Fei Wang, Qian Yang

Abstruse learning algorithms and complex datasets increasingly characterize modern clinical decision support systems (CDSS). As a result, clinicians cannot easily or rapidly scrutinize the CDSS recommendation when facing a difficult diagnosis or treatment decision in practice. Over-trust or under-trust are frequent. Prior research has explored supporting such assessments by explaining DST data inputs and algorithmic mechanisms. This paper explores a different approach: Providing precisely relevant, scientific evidence from biomedical literature. We present a proof-of-concept system, Clinical Evidence Engine, to demonstrate the technical and design feasibility of this approach across three domains (cardiovascular diseases, autism, cancer). Leveraging Clinical BioBERT, the system can effectively identify clinical trial reports based on lengthy clinical questions (e.g., "risks of catheter infection among adult patients in intensive care unit who require arterial catheters, if treated with povidone iodine-alcohol"). This capability enables the system to identify clinical trials relevant to diagnostic/treatment hypotheses -- a clinician's or a CDSS's. Further, Clinical Evidence Engine can identify key parts of a clinical trial abstract, including patient population (e.g., adult patients in intensive care unit who require arterial catheters), intervention (povidone iodine-alcohol), and outcome (risks of catheter infection). This capability opens up the possibility of enabling clinicians to 1) rapidly determine the match between a clinical trial and a clinical question, and 2) understand the result and contexts of the trial without extensive reading. We demonstrate this potential by illustrating two example use scenarios of the system. We discuss the idea of designing DST explanations not as specific to a DST or an algorithm, but as a domain-agnostic decision support infrastructure.

  Access Paper or Ask Questions

Data Troubles in Sentence Level Confidence Estimation for Machine Translation

Oct 26, 2020
Ciprian Chelba, Junpei Zhou, Yuezhang, Li, Hideto Kazawa, Jeff Klingner, Mengmeng Niu

The paper investigates the feasibility of confidence estimation for neural machine translation models operating at the high end of the performance spectrum. As a side product of the data annotation process necessary for building such models we propose sentence level accuracy $SACC$ as a simple, self-explanatory evaluation metric for quality of translation. Experiments on two different annotator pools, one comprised of non-expert (crowd-sourced) and one of expert (professional) translators show that $SACC$ can vary greatly depending on the translation proficiency of the annotators, despite the fact that both pools are about equally reliable according to Krippendorff's alpha metric; the relatively low values of inter-annotator agreement confirm the expectation that sentence-level binary labeling $good$ / $needs\ work$ for translation out of context is very hard. For an English-Spanish translation model operating at $SACC = 0.89$ according to a non-expert annotator pool we can derive a confidence estimate that labels 0.5-0.6 of the $good$ translations in an "in-domain" test set with 0.95 Precision. Switching to an expert annotator pool decreases $SACC$ dramatically: $0.61$ for English-Spanish, measured on the exact same data as above. This forces us to lower the CE model operating point to 0.9 Precision while labeling correctly about 0.20-0.25 of the $good$ translations in the data. We find surprising the extent to which CE depends on the level of proficiency of the annotator pool used for labeling the data. This leads to an important recommendation we wish to make when tackling CE modeling in practice: it is critical to match the end-user expectation for translation quality in the desired domain with the demands of annotators assigning binary quality labels to CE training data.

  Access Paper or Ask Questions

COVI White Paper

May 18, 2020
Hannah Alsdurf, Yoshua Bengio, Tristan Deleu, Prateek Gupta, Daphne Ippolito, Richard Janda, Max Jarvie, Tyler Kolody, Sekoul Krastev, Tegan Maharaj, Robert Obryk, Dan Pilat, Valerie Pisano, Benjamin Prud'homme, Meng Qu, Nasim Rahaman, Irina Rish, Jean-Franois Rousseau, Abhinav Sharma, Brooke Struck, Jian Tang, Martin Weiss, Yun William Yu

The SARS-CoV-2 (Covid-19) pandemic has caused significant strain on public health institutions around the world. Contact tracing is an essential tool to change the course of the Covid-19 pandemic. Manual contact tracing of Covid-19 cases has significant challenges that limit the ability of public health authorities to minimize community infections. Personalized peer-to-peer contact tracing through the use of mobile apps has the potential to shift the paradigm. Some countries have deployed centralized tracking systems, but more privacy-protecting decentralized systems offer much of the same benefit without concentrating data in the hands of a state authority or for-profit corporations. Machine learning methods can circumvent some of the limitations of standard digital tracing by incorporating many clues and their uncertainty into a more graded and precise estimation of infection risk. The estimated risk can provide early risk awareness, personalized recommendations and relevant information to the user. Finally, non-identifying risk data can inform epidemiological models trained jointly with the machine learning predictor. These models can provide statistical evidence for the importance of factors involved in disease transmission. They can also be used to monitor, evaluate and optimize health policy and (de)confinement scenarios according to medical and economic productivity indicators. However, such a strategy based on mobile apps and machine learning should proactively mitigate potential ethical and privacy risks, which could have substantial impacts on society (not only impacts on health but also impacts such as stigmatization and abuse of personal data). Here, we present an overview of the rationale, design, ethical considerations and privacy strategy of `COVI,' a Covid-19 public peer-to-peer contact tracing and risk awareness mobile application developed in Canada.

* 63 pages, 1 figure 

  Access Paper or Ask Questions

Dynamic Ranking with the BTL Model: A Nearest Neighbor based Rank Centrality Method

Sep 28, 2021
Eglantine Karlé, Hemant Tyagi

Many applications such as recommendation systems or sports tournaments involve pairwise comparisons within a collection of $n$ items, the goal being to aggregate the binary outcomes of the comparisons in order to recover the latent strength and/or global ranking of the items. In recent years, this problem has received significant interest from a theoretical perspective with a number of methods being proposed, along with associated statistical guarantees under the assumption of a suitable generative model. While these results typically collect the pairwise comparisons as one comparison graph $G$, however in many applications - such as the outcomes of soccer matches during a tournament - the nature of pairwise outcomes can evolve with time. Theoretical results for such a dynamic setting are relatively limited compared to the aforementioned static setting. We study in this paper an extension of the classic BTL (Bradley-Terry-Luce) model for the static setting to our dynamic setup under the assumption that the probabilities of the pairwise outcomes evolve smoothly over the time domain $[0,1]$. Given a sequence of comparison graphs $(G_{t'})_{t' \in \mathcal{T}}$ on a regular grid $\mathcal{T} \subset [0,1]$, we aim at recovering the latent strengths of the items $w_t \in \mathbb{R}^n$ at any time $t \in [0,1]$. To this end, we adapt the Rank Centrality method - a popular spectral approach for ranking in the static case - by locally averaging the available data on a suitable neighborhood of $t$. When $(G_{t'})_{t' \in \mathcal{T}}$ is a sequence of Erd\"os-Renyi graphs, we provide non-asymptotic $\ell_2$ and $\ell_{\infty}$ error bounds for estimating $w_t^*$ which in particular establishes the consistency of this method in terms of $n$, and the grid size $\lvert\mathcal{T}\rvert$. We also complement our theoretical analysis with experiments on real and synthetic data.

* 44 pages, 3 figures 

  Access Paper or Ask Questions

Declarative Machine Learning Systems

Jul 16, 2021
Piero Molino, Christopher Ré

In the last years machine learning (ML) has moved from a academic endeavor to a pervasive technology adopted in almost every aspect of computing. ML-powered products are now embedded in our digital lives: from recommendations of what to watch, to divining our search intent, to powering virtual assistants in consumer and enterprise settings. Recent successes in applying ML in natural sciences revealed that ML can be used to tackle some of the hardest real-world problems humanity faces today. For these reasons ML has become central in the strategy of tech companies and has gathered even more attention from academia than ever before. Despite these successes, what we have witnessed so far is just the beginning. Right now the people training and using ML models are expert developers working within large organizations, but we believe the next wave of ML systems will allow a larger amount of people, potentially without coding skills, to perform the same tasks. These new ML systems will not require users to fully understand all the details of how models are trained and utilized for obtaining predictions. Declarative interfaces are well suited for this goal, by hiding complexity and favouring separation of interests, and can lead to increased productivity. We worked on such abstract interfaces by developing two declarative ML systems, Overton and Ludwig, that require users to declare only their data schema (names and types of inputs) and tasks rather then writing low level ML code. In this article we will describe how ML systems are currently structured, highlight important factors for their success and adoption, what are the issues current ML systems are facing and how the systems we developed addressed them. Finally we will talk about learnings from the development of ML systems throughout the years and how we believe the next generation of ML systems will look like.

  Access Paper or Ask Questions

Radflow: A Recurrent, Aggregated, and Decomposable Model for Networks of Time Series

Feb 15, 2021
Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

We propose a new model for networks of time series that influence each other. Graph structures among time series are found in diverse domains, such as web traffic influenced by hyperlinks, product sales influenced by recommendation, or urban transport volume influenced by road networks and weather. There has been recent progress in graph modeling and in time series forecasting, respectively, but an expressive and scalable approach for a network of series does not yet exist. We introduce Radflow, a novel model that embodies three key ideas: a recurrent neural network to obtain node embeddings that depend on time, the aggregation of the flow of influence from neighboring nodes with multi-head attention, and the multi-layer decomposition of time series. Radflow naturally takes into account dynamic networks where nodes and edges change over time, and it can be used for prediction and data imputation tasks. On real-world datasets ranging from a few hundred to a few hundred thousand nodes, we observe that Radflow variants are the best performing model across a wide range of settings. The recurrent component in Radflow also outperforms N-BEATS, the state-of-the-art time series model. We show that Radflow can learn different trends and seasonal patterns, that it is robust to missing nodes and edges, and that correlated temporal patterns among network neighbors reflect influence strength. We curate WikiTraffic, the largest dynamic network of time series with 366K nodes and 22M time-dependent links spanning five years. This dataset provides an open benchmark for developing models in this area, with applications that include optimizing resources for the web. More broadly, Radflow has the potential to improve forecasts in correlated time series networks such as the stock market, and impute missing measurements in geographically dispersed networks of natural phenomena.

* Proceedings of The Web Conference 2021 (WWW '21) 
* Published in The Web Conference 2021. Code is available at 

  Access Paper or Ask Questions

motif2vec: Motif Aware Node Representation Learning for Heterogeneous Networks

Aug 22, 2019
Manoj Reddy Dareddy, Mahashweta Das, Hao Yang

Recent years have witnessed a surge of interest in machine learning on graphs and networks with applications ranging from vehicular network design to IoT traffic management to social network recommendations. Supervised machine learning tasks in networks such as node classification and link prediction require us to perform feature engineering that is known and agreed to be the key to success in applied machine learning. Research efforts dedicated to representation learning, especially representation learning using deep learning, has shown us ways to automatically learn relevant features from vast amounts of potentially noisy, raw data. However, most of the methods are not adequate to handle heterogeneous information networks which pretty much represents most real-world data today. The methods cannot preserve the structure and semantic of multiple types of nodes and links well enough, capture higher-order heterogeneous connectivity patterns, and ensure coverage of nodes for which representations are generated. We propose a novel efficient algorithm, motif2vec that learns node representations or embeddings for heterogeneous networks. Specifically, we leverage higher-order, recurring, and statistically significant network connectivity patterns in the form of motifs to transform the original graph to motif graph(s), conduct biased random walk to efficiently explore higher order neighborhoods, and then employ heterogeneous skip-gram model to generate the embeddings. Unlike previous efforts that uses different graph meta-structures to guide the random walk, we use graph motifs to transform the original network and preserve the heterogeneity. We evaluate the proposed algorithm on multiple real-world networks from diverse domains and against existing state-of-the-art methods on multi-class node classification and link prediction tasks, and demonstrate its consistent superiority over prior work.

  Access Paper or Ask Questions