Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a dataset as points in a metric space and compute distances to group together similar points. In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data: by training neural networks to perform instance segmentation on plotted data. Our approach, Visual Clustering, has several advantages over traditional clustering algorithms: it is much faster than most existing clustering algorithms (making it suitable for very large datasets), it agrees strongly with human intuition for clusters, and it is by default hyperparameter free (although additional steps with hyperparameters can be introduced for more control of the algorithm). We describe the method and compare it to ten other clustering methods on synthetic data to illustrate its advantages and disadvantages. We then demonstrate how our approach can be extended to higher dimensional data and illustrate its performance on real-world data. The implementation of Visual Clustering is publicly available and can be applied to any dataset in a few lines of code.
Contrastive learning has achieved state-of-the-art performance in various self-supervised learning tasks and even outperforms its supervised counterpart. Despite its empirical success, theoretical understanding of why contrastive learning works is still limited. In this paper, (i) we provably show that contrastive learning outperforms autoencoder, a classical unsupervised learning method, for both feature recovery and downstream tasks; (ii) we also illustrate the role of labeled data in supervised contrastive learning. This provides theoretical support for recent findings that contrastive learning with labels improves the performance of learned representations in the in-domain downstream task, but it can harm the performance in transfer learning. We verify our theory with numerical experiments.
Machine learning (ML) prediction APIs are increasingly widely used. An ML API can change over time due to model updates or retraining. This presents a key challenge in the usage of the API because it is often not clear to the user if and how the ML model has changed. Model shifts can affect downstream application performance and also create oversight issues (e.g. if consistency is desired). In this paper, we initiate a systematic investigation of ML API shifts. We first quantify the performance shifts from 2020 to 2021 of popular ML APIs from Google, Microsoft, Amazon, and others on a variety of datasets. We identified significant model shifts in 12 out of 36 cases we investigated. Interestingly, we found several datasets where the API's predictions became significantly worse over time. This motivated us to formulate the API shift assessment problem at a more fine-grained level as estimating how the API model's confusion matrix changes over time when the data distribution is constant. Monitoring confusion matrix shifts using standard random sampling can require a large number of samples, which is expensive as each API call costs a fee. We propose a principled adaptive sampling algorithm, MASA, to efficiently estimate confusion matrix shifts. MASA can accurately estimate the confusion matrix shifts in commercial ML APIs using up to 90% fewer samples compared to random sampling. This work establishes ML API shifts as an important problem to study and provides a cost-effective approach to monitor such shifts.
In many applications of AI, the algorithm's output is framed as a suggestion to a human user. The user may ignore the advice or take it into consideration to modify his/her decisions. With the increasing prevalence of such human-AI interactions, it is important to understand how users act (or do not act) upon AI advice, and how users regard advice differently if they believe the advice come from an "AI" versus another human. In this paper, we characterize how humans use AI suggestions relative to equivalent suggestions from a group of peer humans across several experimental settings. We find that participants' beliefs about the human versus AI performance on a given task affects whether or not they heed the advice. When participants decide to use the advice, they do so similarly for human and AI suggestions. These results provide insights into factors that affect human-AI interactions.
Understanding and explaining the mistakes made by trained models is critical to many machine learning objectives, such as improving robustness, addressing concept drift, and mitigating biases. However, this is often an ad hoc process that involves manually looking at the model's mistakes on many test samples and guessing at the underlying reasons for those incorrect predictions. In this paper, we propose a systematic approach, conceptual explanation scores (CES), that explains why a classifier makes a mistake on a particular test sample(s) in terms of human-understandable concepts (e.g. this zebra is misclassified as a dog because of faint stripes). We base CES on two prior ideas: counterfactual explanations and concept activation vectors, and validate our approach on well-known pretrained models, showing that it explains the models' mistakes meaningfully. We also train new models with intentional and known spurious correlations, which CES successfully identifies from a single misclassified test sample. The code for CES is publicly available and can easily be applied to new models.
Robust training methods against perturbations to the input data have received great attention in the machine learning literature. A standard approach in this direction is adversarial training which learns a model using adversarially-perturbed training samples. However, adversarial training performs suboptimally against perturbations structured across samples such as universal and group-sparse shifts that are commonly present in biological data such as gene expression levels of different tissues. In this work, we seek to close this optimality gap and introduce Group-Structured Adversarial Training (GSAT) which learns a model robust to perturbations structured across samples. We formulate GSAT as a non-convex concave minimax optimization problem which minimizes a group-structured optimal transport cost. Specifically, we focus on the applications of GSAT for group-sparse and rank-constrained perturbations modeled using group and nuclear norm penalties. In order to solve GSAT's non-smooth optimization problem in those cases, we propose a new minimax optimization algorithm called GDADMM by combining Gradient Descent Ascent (GDA) and Alternating Direction Method of Multipliers (ADMM). We present several applications of the GSAT framework to gain robustness against structured perturbations for image recognition and computational biology datasets.
Transfer learning aims to leverage models pre-trained on source data to efficiently adapt to target setting, where only limited data are available for model fine-tuning. Recent works empirically demonstrate that adversarial training in the source data can improve the ability of models to transfer to new domains. However, why this happens is not known. In this paper, we provide a theoretical model to rigorously analyze how adversarial training helps transfer learning. We show that adversarial training in the source data generates provably better representations, so fine-tuning on top of this representation leads to a more accurate predictor of the target data. We further demonstrate both theoretically and empirically that semi-supervised learning in the source data can also improve transfer learning by similarly improving the representation. Moreover, performing adversarial training on top of semi-supervised learning can further improve transferability, suggesting that the two approaches have complementary benefits on representations. We support our theories with experiments on popular data sets and deep learning architectures.
Post-deployment monitoring of the performance of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled features and a small amount of on-demand labeled examples over time to produce a real-time estimate of the ML model's current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon substantially outperforms existing monitoring approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal up to logarithmic factors and is provably robust against broad distribution drifts whereas prior approaches are not.
Annotating the right set of data amongst all available data points is a key challenge in many machine learning applications. Batch active learning is a popular approach to address this, in which batches of unlabeled data points are selected for annotation, while an underlying learning algorithm gets subsequently updated. Increasingly larger batches are particularly appealing in settings where data can be annotated in parallel, and model training is computationally expensive. A key challenge here is scale - typical active learning methods rely on diversity techniques, which select a diverse set of data points to annotate, from an unlabeled pool. In this work, we introduce Active Data Shapley (ADS) -- a filtering layer for batch active learning that significantly increases the efficiency of active learning by pre-selecting, using a linear time computation, the highest-value points from an unlabeled dataset. Using the notion of the Shapley value of data, our method estimates the value of unlabeled data points with regards to the prediction task at hand. We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats: noise, heterogeneity, and domain shift. We run experiments demonstrating that when ADS is used to pre-select the highest-ranking portion of an unlabeled dataset, the efficiency of state-of-the-art batch active learning methods increases by an average factor of 6x, while preserving performance effectiveness.