We study the problem of learning classifiers with a fairness constraint, with three main contributions towards the goal of quantifying the problem's inherent tradeoffs. First, we relate two existing fairness measures to cost-sensitive risks. Second, we show that for cost-sensitive classification and fairness measures, the optimal classifier is an instance-dependent thresholding of the class-probability function. Third, we show how the tradeoff between accuracy and fairness is determined by the alignment between the class-probabilities for the target and sensitive features. Underpinning our analysis is a general framework that casts the problem of learning with a fairness requirement as one of minimising the difference of two statistical risks.
Bregman divergences play a central role in the design and analysis of a range of machine learning algorithms. This paper explores the use of Bregman divergences to establish reductions between such algorithms and their analyses. We present a new scaled isodistortion theorem involving Bregman divergences (scaled Bregman theorem for short) which shows that certain "Bregman distortions'" (employing a potentially non-convex generator) may be exactly re-written as a scaled Bregman divergence computed over transformed data. Admissible distortions include geodesic distances on curved manifolds and projections or gauge-normalisation, while admissible data include scalars, vectors and matrices. Our theorem allows one to leverage to the wealth and convenience of Bregman divergences when analysing algorithms relying on the aforementioned Bregman distortions. We illustrate this with three novel applications of our theorem: a reduction from multi-class density ratio to class-probability estimation, a new adaptive projection free yet norm-enforcing dual norm mirror descent algorithm, and a reduction from clustering on flat manifolds to clustering on curved manifolds. Experiments on each of these domains validate the analyses and suggest that the scaled Bregman theorem might be a worthy addition to the popular handful of Bregman divergence properties that have been pervasive in machine learning.
Suppose we have a sample of instances paired with binary labels corrupted by arbitrary instance- and label-dependent noise. With sufficiently many such samples, can we optimally classify and rank instances with respect to the noise-free distribution? We provide a theoretical analysis of this question, with three main contributions. First, we prove that for instance-dependent noise, any algorithm that is consistent for classification on the noisy distribution is also consistent on the clean distribution. Second, we prove that for a broad class of instance- and label-dependent noise, a similar consistency result holds for the area under the ROC curve. Third, for the latter noise model, when the noise-free class-probability function belongs to the generalised linear model family, we show that the Isotron can efficiently and provably learn from the corrupted sample.
Many classification algorithms produce a classifier that is a weighted average of kernel evaluations. When working with a high or infinite dimensional kernel, it is imperative for speed of evaluation and storage issues that as few training samples as possible are used in the kernel expansion. Popular existing approaches focus on altering standard learning algorithms, such as the Support Vector Machine, to induce sparsity, as well as post-hoc procedures for sparse approximations. Here we adopt the latter approach. We begin with a very simple classifier, given by the kernel mean $$ f(x) = \frac{1}{n} \sum\limits_{i=i}^{n} y_i K(x_i,x) $$ We then find a sparse approximation to this kernel mean via herding. The result is an accurate, easily parallelized algorithm for learning classifiers.
Convex potential minimisation is the de facto approach to binary classification. However, Long and Servedio [2010] proved that under symmetric label noise (SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2010] result by virtue of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong l2 regularisation makes most standard learners SLN-robust. Experiments confirm the SLN-robustness of the unhinged loss.
In Programming by Example, a system attempts to infer a program from input and output examples, generally by searching for a composition of certain base functions. Performing a naive brute force search is infeasible for even mildly involved tasks. We note that the examples themselves often present clues as to which functions to compose, and how to rank the resulting programs. In text processing, which is our domain of interest, clues arise from simple textual features: for example, if parts of the input and output strings are permutations of one another, this suggests that sorting may be useful. We describe a system that learns the reliability of such clues, allowing for faster search and a principled ranking over programs. Experiments on a prototype of this system show that this learning scheme facilitates efficient inference on a range of text processing tasks.
In dyadic prediction, labels must be predicted for pairs (dyads) whose members possess unique identifiers and, sometimes, additional features called side-information. Special cases of this problem include collaborative filtering and link prediction. We present the first model for dyadic prediction that satisfies several important desiderata: (i) labels may be ordinal or nominal, (ii) side-information can be easily exploited if present, (iii) with or without side-information, latent features are inferred for dyad members, (iv) it is resistant to sample-selection bias, (v) it can learn well-calibrated probabilities, and (vi) it can scale to very large datasets. To our knowledge, no existing method satisfies all the above criteria. In particular, many methods assume that the labels are ordinal and ignore side-information when it is present. Experimental results show that the new method is competitive with state-of-the-art methods for the special cases of collaborative filtering and link prediction, and that it makes accurate predictions on nominal data.