Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Heinrich Jiang

Faster DBSCAN via subsampled similarity queries

Jun 11, 2020
Heinrich Jiang, Jennifer Jang, Jakub Łącki

Figure 1 for Faster DBSCAN via subsampled similarity queries

Figure 2 for Faster DBSCAN via subsampled similarity queries

Figure 3 for Faster DBSCAN via subsampled similarity queries

Figure 4 for Faster DBSCAN via subsampled similarity queries

DBSCAN is a popular density-based clustering algorithm. It computes the $\epsilon$-neighborhood graph of a dataset and uses the connected components of the high-degree nodes to decide the clusters. However, the full neighborhood graph may be too costly to compute with a worst-case complexity of $O(n^2)$. In this paper, we propose a simple variant called SNG-DBSCAN, which clusters based on a subsampled $\epsilon$-neighborhood graph, only requires access to similarity queries for pairs of points and in particular avoids any complex data structures which need the embeddings of the data points themselves. The runtime of the procedure is $O(sn^2)$, where $s$ is the sampling rate. We show under some natural theoretical assumptions that $s \approx \log n/n$ is sufficient for statistical cluster recovery guarantees leading to an $O(n\log n)$ complexity. We provide an extensive experimental analysis showing that on large datasets, one can subsample as little as $0.1\%$ of the neighborhood graph, leading to as much as over 200x speedup and 250x reduction in RAM consumption compared to scikit-learn's implementation of DBSCAN, while still maintaining competitive clustering performance.

Via

Access Paper or Ask Questions

Learning the Truth From Only One Side of the Story

Jun 08, 2020
Heinrich Jiang, Qijia Jiang, Aldo Pacchiano

Figure 1 for Learning the Truth From Only One Side of the Story

Figure 2 for Learning the Truth From Only One Side of the Story

Figure 3 for Learning the Truth From Only One Side of the Story

Figure 4 for Learning the Truth From Only One Side of the Story

Learning under one-sided feedback (i.e., where examples arrive in an online fashion and the learner only sees the labels for examples it predicted positively on) is a fundamental problem in machine learning -- applications include lending and recommendation systems. Despite this, there has been surprisingly little progress made in ways to mitigate the effects of the sampling bias that arises. We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge sub-optimally or even fail to converge to the optimal solution. We propose an adaptive Upper Confidence Bound approach that comes with rigorous regret guarantees and we show that it outperforms several existing methods experimentally. Our method leverages uncertainty estimation techniques for generalized linear models to more efficiently explore uncertain areas than existing approaches which explore randomly.

Via

Access Paper or Ask Questions

Deep k-NN for Noisy Labels

Apr 26, 2020
Dara Bahri, Heinrich Jiang, Maya Gupta

Modern machine learning models are often trained on examples with noisy labels that hurt performance and are hard to identify. In this paper, we provide an empirical study showing that a simple $k$-nearest neighbor-based filtering approach on the logit layer of a preliminary model can remove mislabeled training data and produce more accurate models than many recently proposed methods. We also provide new statistical guarantees into its efficacy.

* Full paper (including supplemental) can be found at https://github.com/dbahri/deepknn

Via

Access Paper or Ask Questions

Robustness Guarantees for Mode Estimation with an Application to Bandits

Mar 05, 2020
Aldo Pacchiano, Heinrich Jiang, Michael I. Jordan

Figure 1 for Robustness Guarantees for Mode Estimation with an Application to Bandits

Figure 2 for Robustness Guarantees for Mode Estimation with an Application to Bandits

Figure 3 for Robustness Guarantees for Mode Estimation with an Application to Bandits

Figure 4 for Robustness Guarantees for Mode Estimation with an Application to Bandits

Mode estimation is a classical problem in statistics with a wide range of applications in machine learning. Despite this, there is little understanding in its robustness properties under possibly adversarial data contamination. In this paper, we give precise robustness guarantees as well as privacy guarantees under simple randomization. We then introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean. We prove regret guarantees for the problems of top arm identification, top m-arms identification, contextual modal bandits, and infinite continuous arms top arm recovery. We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences, thus rendering modal bandits an attractive choice in situations where the rewards may have outliers or adversarial corruptions.

* 12 pages, 7 figures, 14 appendix pages

Via

Access Paper or Ask Questions

Group-based Fair Learning Leads to Counter-intuitive Predictions

Oct 04, 2019
Ofir Nachum, Heinrich Jiang

Figure 1 for Group-based Fair Learning Leads to Counter-intuitive Predictions

Figure 2 for Group-based Fair Learning Leads to Counter-intuitive Predictions

Figure 3 for Group-based Fair Learning Leads to Counter-intuitive Predictions

Figure 4 for Group-based Fair Learning Leads to Counter-intuitive Predictions

A number of machine learning (ML) methods have been proposed recently to maximize model predictive accuracy while enforcing notions of group parity or fairness across sub-populations. We propose a desirable property for these procedures, slack-consistency: For any individual, the predictions of the model should be monotonic with respect to allowed slack (i.e., maximum allowed group-parity violation). Such monotonicity can be useful for individuals to understand the impact of enforcing fairness on their predictions. Surprisingly, we find that standard ML methods for enforcing fairness violate this basic property. Moreover, this undesirable behavior arises in situations agnostic to the complexity of the underlying model or approximate optimizations, suggesting that the simple act of incorporating a constraint can lead to drastically unintended behavior in ML. We present a simple theoretical method for enforcing slack-consistency, while encouraging further discussions on the unintended behaviors potentially induced when enforcing group-based parity.

Via

Access Paper or Ask Questions

Wasserstein Fair Classification

Jul 28, 2019
Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, Silvia Chiappa

Figure 1 for Wasserstein Fair Classification

Figure 2 for Wasserstein Fair Classification

Figure 3 for Wasserstein Fair Classification

Figure 4 for Wasserstein Fair Classification

We propose an approach to fair classification that enforces independence between the classifier outputs and sensitive information by minimizing Wasserstein-1 distances. The approach has desirable theoretical properties and is robust to specific choices of the threshold used to obtain class predictions from model outputs. We introduce different methods that enable hiding sensitive information at test time or have a simple and fast implementation. We show empirical performance against different fairness baselines on several benchmark fairness datasets.

* Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, 2019

Via

Access Paper or Ask Questions

Minimum-Margin Active Learning

May 31, 2019
Heinrich Jiang, Maya Gupta

Figure 1 for Minimum-Margin Active Learning

Figure 2 for Minimum-Margin Active Learning

Figure 3 for Minimum-Margin Active Learning

Figure 4 for Minimum-Margin Active Learning

We present a new active sampling method we call min-margin which trains multiple learners on bootstrap samples and then chooses the examples to label based on the candidates' minimum margin amongst the bootstrapped models. This extends standard margin sampling in a way that increases its diversity in a supervised manner as it arises from the model uncertainty. We focus on the one-shot batch active learning setting, and show theoretically and through extensive experiments on a broad set of problems that min-margin outperforms other methods, particularly as batch size grows.

Via

Access Paper or Ask Questions

Identifying and Correcting Label Bias in Machine Learning

Jan 15, 2019
Heinrich Jiang, Ofir Nachum

Figure 1 for Identifying and Correcting Label Bias in Machine Learning

Figure 2 for Identifying and Correcting Label Bias in Machine Learning

Figure 3 for Identifying and Correcting Label Bias in Machine Learning

Figure 4 for Identifying and Correcting Label Bias in Machine Learning

Datasets often contain biases which unfairly disadvantage certain groups, and classifiers trained on such datasets can inherit these biases. In this paper, we provide a mathematical formulation of how this bias can arise. We do so by assuming the existence of underlying, unknown, and unbiased labels which are overwritten by an agent who intends to provide accurate labels but may have biases against certain groups. Despite the fact that we only observe the biased labels, we are able to show that the bias may nevertheless be corrected by re-weighting the data points without changing the labels. We show, with theoretical guarantees, that training on the re-weighted dataset corresponds to training on the unobserved but unbiased labels, thus leading to an unbiased machine learning classifier. Our procedure is fast and robust and can be used with virtually any learning algorithm. We evaluate on a number of standard machine learning fairness datasets and a variety of fairness notions, finding that our method outperforms standard approaches in achieving fair classification.

Via

Access Paper or Ask Questions

Non-Asymptotic Uniform Rates of Consistency for k-NN Regression

Nov 03, 2018
Heinrich Jiang

We derive high-probability finite-sample uniform rates of consistency for $k$-NN regression that are optimal up to logarithmic factors under mild assumptions. We moreover show that $k$-NN regression adapts to an unknown lower intrinsic dimension automatically. We then apply the $k$-NN regression rates to establish new results about estimating the level sets and global maxima of a function from noisy observations.

* In Proceedings of 33rd AAAI Conference on Artificial Intelligence (AAAI 2019)

Via

Access Paper or Ask Questions

DBSCAN++: Towards fast and scalable density clustering

Oct 31, 2018
Jennifer Jang, Heinrich Jiang

Figure 1 for DBSCAN++: Towards fast and scalable density clustering

DBSCAN is a classical density-based clustering procedure which has had tremendous practical relevance. However, it implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which may be too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a subset of the points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest.

Via

Access Paper or Ask Questions