Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tobias Schnabel

Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

Jun 26, 2017

Aman Agarwal, Soumya Basu, Tobias Schnabel, Thorsten Joachims

Figure 1 for Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

Figure 2 for Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

Figure 3 for Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

Figure 4 for Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

Abstract:Accurately evaluating new policies (e.g. ad-placement models, ranking functions, recommendation functions) is one of the key prerequisites for improving interactive systems. While the conventional approach to evaluation relies on online A/B tests, recent work has shown that counterfactual estimators can provide an inexpensive and fast alternative, since they can be applied offline using log data that was collected from a different policy fielded in the past. In this paper, we address the question of how to estimate the performance of a new target policy when we have log data from multiple historic policies. This question is of great relevance in practice, since policies get updated frequently in most online systems. We show that naively combining data from multiple logging policies can be highly suboptimal. In particular, we find that the standard Inverse Propensity Score (IPS) estimator suffers especially when logging and target policies diverge -- to a point where throwing away data improves the variance of the estimator. We therefore propose two alternative estimators which we characterize theoretically and compare experimentally. We find that the new estimators can provide substantially improved estimation accuracy.

* KDD 2018

Via

Access Paper or Ask Questions

Unbiased Learning-to-Rank with Biased Feedback

Aug 16, 2016

Thorsten Joachims, Adith Swaminathan, Tobias Schnabel

Figure 1 for Unbiased Learning-to-Rank with Biased Feedback

Figure 2 for Unbiased Learning-to-Rank with Biased Feedback

Figure 3 for Unbiased Learning-to-Rank with Biased Feedback

Figure 4 for Unbiased Learning-to-Rank with Biased Feedback

Abstract:Implicit feedback (e.g., clicks, dwell times, etc.) is an abundant source of data in human-interactive systems. While implicit feedback has many advantages (e.g., it is inexpensive to collect, user centric, and timely), its inherent biases are a key obstacle to its effective use. For example, position bias in search rankings strongly influences how many clicks a result receives, so that directly using click data as a training signal in Learning-to-Rank (LTR) methods yields sub-optimal results. To overcome this bias problem, we present a counterfactual inference framework that provides the theoretical basis for unbiased LTR via Empirical Risk Minimization despite biased data. Using this framework, we derive a Propensity-Weighted Ranking SVM for discriminative learning from implicit feedback, where click models take the role of the propensity estimator. In contrast to most conventional approaches to de-bias the data using click models, this allows training of ranking functions even in settings where queries do not repeat. Beyond the theoretical support, we show empirically that the proposed learning method is highly effective in dealing with biases, that it is robust to noise and propensity model misspecification, and that it scales efficiently. We also demonstrate the real-world applicability of our approach on an operational search engine, where it substantially improves retrieval performance.

Via

Access Paper or Ask Questions

Recommendations as Treatments: Debiasing Learning and Evaluation

May 27, 2016

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, Thorsten Joachims

Figure 1 for Recommendations as Treatments: Debiasing Learning and Evaluation

Figure 2 for Recommendations as Treatments: Debiasing Learning and Evaluation

Figure 3 for Recommendations as Treatments: Debiasing Learning and Evaluation

Figure 4 for Recommendations as Treatments: Debiasing Learning and Evaluation

Abstract:Most data for evaluating and training recommender systems is subject to selection biases, either through self-selection by the users or through the actions of the recommendation system itself. In this paper, we provide a principled approach to handling selection biases, adapting models and estimation techniques from causal inference. The approach leads to unbiased performance estimators despite biased data, and to a matrix factorization method that provides substantially improved prediction performance on real-world data. We theoretically and empirically characterize the robustness of the approach, finding that it is highly practical and scalable.

* 10 pages in ICML 2016

Via

Access Paper or Ask Questions

Unbiased Comparative Evaluation of Ranking Functions

Apr 25, 2016

Tobias Schnabel, Adith Swaminathan, Peter Frazier, Thorsten Joachims

Figure 1 for Unbiased Comparative Evaluation of Ranking Functions

Figure 2 for Unbiased Comparative Evaluation of Ranking Functions

Figure 3 for Unbiased Comparative Evaluation of Ranking Functions

Figure 4 for Unbiased Comparative Evaluation of Ranking Functions

Abstract:Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing $k$ systems against a baseline, and ranking $k$ systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.

* Under review; 10 pages

Via

Access Paper or Ask Questions

Online Updating of Word Representations for Part-of-Speech Tagging

Apr 02, 2016

Wenpeng Yin, Tobias Schnabel, Hinrich Schütze

Figure 1 for Online Updating of Word Representations for Part-of-Speech Tagging

Figure 2 for Online Updating of Word Representations for Part-of-Speech Tagging

Figure 3 for Online Updating of Word Representations for Part-of-Speech Tagging

Figure 4 for Online Updating of Word Representations for Part-of-Speech Tagging

Abstract:We propose online unsupervised domain adaptation (DA), which is performed incrementally as data comes in and is applicable when batch DA is not possible. In a part-of-speech (POS) tagging evaluation, we find that online unsupervised DA performs as well as batch DA.

* EMNLP'2015. Released POS tagger "FLORS" for online domain adaptation

Via

Access Paper or Ask Questions

Using Shortlists to Support Decision Making and Improve Recommender System Performance

Feb 08, 2016

Tobias Schnabel, Paul N. Bennett, Susan T. Dumais, Thorsten Joachims

Figure 1 for Using Shortlists to Support Decision Making and Improve Recommender System Performance

Figure 2 for Using Shortlists to Support Decision Making and Improve Recommender System Performance

Figure 3 for Using Shortlists to Support Decision Making and Improve Recommender System Performance

Figure 4 for Using Shortlists to Support Decision Making and Improve Recommender System Performance

Abstract:In this paper, we study shortlists as an interface component for recommender systems with the dual goal of supporting the user's decision process, as well as improving implicit feedback elicitation for increased recommendation quality. A shortlist is a temporary list of candidates that the user is currently considering, e.g., a list of a few movies the user is currently considering for viewing. From a cognitive perspective, shortlists serve as digital short-term memory where users can off-load the items under consideration -- thereby decreasing their cognitive load. From a machine learning perspective, adding items to the shortlist generates a new implicit feedback signal as a by-product of exploration and decision making which can improve recommendation quality. Shortlisting therefore provides additional data for training recommendation systems without the increases in cognitive load that requesting explicit feedback would incur. We perform an user study with a movie recommendation setup to compare interfaces that offer shortlist support with those that do not. From the user studies we conclude: (i) users make better decisions with a shortlist; (ii) users prefer an interface with shortlist support; and (iii) the additional implicit feedback from sessions with a shortlist improves the quality of recommendations by nearly a factor of two.

* 11 pages in WWW 2016

Via

Access Paper or Ask Questions

Towards a Better Understanding of Predict and Count Models

Nov 06, 2015

S. Sathiya Keerthi, Tobias Schnabel, Rajiv Khanna

Figure 1 for Towards a Better Understanding of Predict and Count Models

Figure 2 for Towards a Better Understanding of Predict and Count Models

Figure 3 for Towards a Better Understanding of Predict and Count Models

Figure 4 for Towards a Better Understanding of Predict and Count Models

Abstract:In a recent paper, Levy and Goldberg pointed out an interesting connection between prediction-based word embedding models and count models based on pointwise mutual information. Under certain conditions, they showed that both models end up optimizing equivalent objective functions. This paper explores this connection in more detail and lays out the factors leading to differences between these models. We find that the most relevant differences from an optimization perspective are (i) predict models work in a low dimensional space where embedding vectors can interact heavily; (ii) since predict models have fewer parameters, they are less prone to overfitting. Motivated by the insight of our analysis, we show how count models can be regularized in a principled manner and provide closed-form solutions for L1 and L2 regularization. Finally, we propose a new embedding model with a convex objective and the additional benefit of being intelligible.

* 17 pages

Via

Access Paper or Ask Questions