Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alex Beutel

Tony

Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Feb 02, 2023

Zhouxing Shi, Nicholas Carlini, Ananth Balashankar, Ludwig Schmidt, Cho-Jui Hsieh, Alex Beutel, Yao Qin

Abstract:``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.

Via

Access Paper or Ask Questions

Striving for data-model efficiency: Identifying data externalities on group performance

Nov 11, 2022

Esther Rolf, Ben Packer, Alex Beutel, Fernando Diaz

Figure 1 for Striving for data-model efficiency: Identifying data externalities on group performance

Figure 2 for Striving for data-model efficiency: Identifying data externalities on group performance

Figure 3 for Striving for data-model efficiency: Identifying data externalities on group performance

Abstract:Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population, a phenomenon we refer to as negative data externalities on group performance. Such externalities can arise in standard learning settings and can manifest differently depending on conditions between training set size and model size. Data externalities directly imply a lower bound on feasible model improvements, yet improving models efficiently requires understanding the underlying data-model tensions. From a broader perspective, our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.

* 9 pages, 3 figures. Trustworthy and Socially Responsible Machine Learning (TSRML 2022) workshop co-located with NeurIPS 2022

Via

Access Paper or Ask Questions

A Human-ML Collaboration Framework for Improving Video Content Reviews

Oct 18, 2022

Meghana Deodhar, Xiao Ma, Yixin Cai, Alex Koes, Alex Beutel, Jilin Chen

Figure 1 for A Human-ML Collaboration Framework for Improving Video Content Reviews

Figure 2 for A Human-ML Collaboration Framework for Improving Video Content Reviews

Figure 3 for A Human-ML Collaboration Framework for Improving Video Content Reviews

Figure 4 for A Human-ML Collaboration Framework for Improving Video Content Reviews

Abstract:We deal with the problem of localized in-video taxonomic human annotation in the video content moderation domain, where the goal is to identify video segments that violate granular policies, e.g., community guidelines on an online video platform. High quality human labeling is critical for enforcement in content moderation. This is challenging due to the problem of information overload - raters need to apply a large taxonomy of granular policy violations with ambiguous definitions, within a limited review duration to relatively long videos. Our key contribution is a novel human-machine learning (ML) collaboration framework aimed at maximizing the quality and efficiency of human decisions in this setting - human labels are used to train segment-level models, the predictions of which are displayed as "hints" to human raters, indicating probable regions of the video with specific policy violations. The human verified/corrected segment labels can help refine the model further, hence creating a human-ML positive feedback loop. Experiments show improved human video moderation decision quality, and efficiency through more granular annotations submitted within a similar review duration, which enable a 5-8% AUC improvement in the hint generation models.

* 5 pages, Human-in-the-Loop Data Curation Workshop CIKM'22

Via

Access Paper or Ask Questions

Simpson's Paradox in Recommender Fairness: Reconciling differences between per-user and aggregated evaluations

Oct 14, 2022

Flavien Prost, Ben Packer, Jilin Chen, Li Wei, Pierre Kremp, Nicholas Blumm, Susan Wang, Tulsee Doshi, Tonia Osadebe, Lukasz Heldt(+2 more)

Figure 1 for Simpson's Paradox in Recommender Fairness: Reconciling differences between per-user and aggregated evaluations

Figure 2 for Simpson's Paradox in Recommender Fairness: Reconciling differences between per-user and aggregated evaluations

Figure 3 for Simpson's Paradox in Recommender Fairness: Reconciling differences between per-user and aggregated evaluations

Figure 4 for Simpson's Paradox in Recommender Fairness: Reconciling differences between per-user and aggregated evaluations

Abstract:There has been a flurry of research in recent years on notions of fairness in ranking and recommender systems, particularly on how to evaluate if a recommender allocates exposure equally across groups of relevant items (also known as provider fairness). While this research has laid an important foundation, it gave rise to different approaches depending on whether relevant items are compared per-user/per-query or aggregated across users. Despite both being established and intuitive, we discover that these two notions can lead to opposite conclusions, a form of Simpson's Paradox. We reconcile these notions and show that the tension is due to differences in distributions of users where items are relevant, and break down the important factors of the user's recommendations. Based on this new understanding, practitioners might be interested in either notions, but might face challenges with the per-user metric due to partial observability of the relevance and user satisfaction, typical in real-world recommenders. We describe a technique based on distribution matching to estimate it in such a scenario. We demonstrate on simulated and real-world recommender data the effectiveness and usefulness of such an approach.

Via

Access Paper or Ask Questions

Flexible text generation for counterfactual fairness probing

Jun 28, 2022

Zee Fryer, Vera Axelrod, Ben Packer, Alex Beutel, Jilin Chen, Kellie Webster

Figure 1 for Flexible text generation for counterfactual fairness probing

Figure 2 for Flexible text generation for counterfactual fairness probing

Figure 3 for Flexible text generation for counterfactual fairness probing

Figure 4 for Flexible text generation for counterfactual fairness probing

Abstract:A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals: does the classifier output change if a sensitive attribute in the input is changed? Existing counterfactual generation methods typically rely on wordlists or templates, producing simple counterfactuals that don't take into account grammar, context, or subtle sensitive attribute references, and could miss issues that the wordlist creators had not considered. In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to make progress on this task. We show that this LLM-based method can produce complex counterfactuals that existing methods cannot, comparing the performance of various counterfactual generation methods on the Civil Comments dataset and showing their value in evaluating a toxicity classifier.

Via

Access Paper or Ask Questions

Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Oct 15, 2021

Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang

Figure 1 for Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Figure 2 for Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Figure 3 for Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Figure 4 for Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Abstract:We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further. All the code in this work will be open-sourced.

Via

Access Paper or Ask Questions

Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

Jun 04, 2021

Yuyan Wang, Xuezhi Wang, Alex Beutel, Flavien Prost, Jilin Chen, Ed H. Chi

Figure 1 for Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

Figure 2 for Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

Figure 3 for Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

Figure 4 for Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

Abstract:As multi-task models gain popularity in a wider range of machine learning applications, it is becoming increasingly important for practitioners to understand the fairness implications associated with those models. Most existing fairness literature focuses on learning a single task more fairly, while how ML fairness interacts with multiple tasks in the joint learning setting is largely under-explored. In this paper, we are concerned with how group fairness (e.g., equal opportunity, equalized odds) as an ML fairness concept plays out in the multi-task scenario. In multi-task learning, several tasks are learned jointly to exploit task correlations for a more efficient inductive transfer. This presents a multi-dimensional Pareto frontier on (1) the trade-off between group fairness and accuracy with respect to each task, as well as (2) the trade-offs across multiple tasks. We aim to provide a deeper understanding on how group fairness interacts with accuracy in multi-task learning, and we show that traditional approaches that mainly focus on optimizing the Pareto frontier of multi-task accuracy might not perform well on fairness goals. We propose a new set of metrics to better capture the multi-dimensional Pareto frontier of fairness-accuracy trade-offs uniquely presented in a multi-task learning setting. We further propose a Multi-Task-Aware Fairness (MTA-F) approach to improve fairness in multi-task learning. Experiments on several real-world datasets demonstrate the effectiveness of our proposed approach.

Via

Access Paper or Ask Questions

Measuring Model Fairness under Noisy Covariates: A Theoretical Perspective

May 20, 2021

Flavien Prost, Pranjal Awasthi, Nick Blumm, Aditee Kumthekar, Trevor Potter, Li Wei, Xuezhi Wang, Ed H. Chi, Jilin Chen, Alex Beutel

Figure 1 for Measuring Model Fairness under Noisy Covariates: A Theoretical Perspective

Figure 2 for Measuring Model Fairness under Noisy Covariates: A Theoretical Perspective

Figure 3 for Measuring Model Fairness under Noisy Covariates: A Theoretical Perspective

Figure 4 for Measuring Model Fairness under Noisy Covariates: A Theoretical Perspective

Abstract:In this work we study the problem of measuring the fairness of a machine learning model under noisy information. Focusing on group fairness metrics, we investigate the particular but common situation when the evaluation requires controlling for the confounding effect of covariate variables. In a practical setting, we might not be able to jointly observe the covariate and group information, and a standard workaround is to then use proxies for one or more of these variables. Prior works have demonstrated the challenges with using a proxy for sensitive attributes, and strong independence assumptions are needed to provide guarantees on the accuracy of the noisy estimates. In contrast, in this work we study using a proxy for the covariate variable and present a theoretical analysis that aims to characterize weaker conditions under which accurate fairness evaluation is possible. Furthermore, our theory identifies potential sources of errors and decouples them into two interpretable parts $\gamma$ and $\epsilon$. The first part $\gamma$ depends solely on the performance of the proxy such as precision and recall, whereas the second part $\epsilon$ captures correlations between all the variables of interest. We show that in many scenarios the error in the estimates is dominated by $\gamma$ via a linear dependence, whereas the dependence on the correlations $\epsilon$ only constitutes a lower order term. As a result we expand the understanding of scenarios where measuring model fairness via proxies can be an effective approach. Finally, we compare, via simulations, the theoretical upper-bounds to the distribution of simulated estimation errors and show that assuming some structure on the data, even weak, is key to significantly improve both theoretical guarantees and empirical results.

Via

Access Paper or Ask Questions

Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities

May 06, 2021

Ruohan Zhan, Konstantina Christakopoulou, Ya Le, Jayden Ooi, Martin Mladenov, Alex Beutel, Craig Boutilier, Ed H. Chi, Minmin Chen

Figure 1 for Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities

Figure 2 for Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities

Figure 3 for Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities

Figure 4 for Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities

Abstract:Most existing recommender systems focus primarily on matching users to content which maximizes user satisfaction on the platform. It is increasingly obvious, however, that content providers have a critical influence on user satisfaction through content creation, largely determining the content pool available for recommendation. A natural question thus arises: can we design recommenders taking into account the long-term utility of both users and content providers? By doing so, we hope to sustain more providers and a more diverse content pool for long-term user satisfaction. Understanding the full impact of recommendations on both user and provider groups is challenging. This paper aims to serve as a research investigation of one approach toward building a provider-aware recommender, and evaluating its impact in a simulated setup. To characterize the user-recommender-provider interdependence, we complement user modeling by formalizing provider dynamics as well. The resulting joint dynamical system gives rise to a weakly-coupled partially observable Markov decision process driven by recommender actions and user feedback to providers. We then build a REINFORCE recommender agent, coined EcoAgent, to optimize a joint objective of user utility and the counterfactual utility lift of the provider associated with the recommended content, which we show to be equivalent to maximizing overall user utility and the utilities of all providers on the platform under some mild assumptions. To evaluate our approach, we introduce a simulation environment capturing the key interactions among users, providers, and the recommender. We offer a number of simulated experiments that shed light on both the benefits and the limitations of our approach. These results help understand how and when a provider-aware recommender agent is of benefit in building multi-stakeholder recommender systems.

Via

Access Paper or Ask Questions

Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Feb 16, 2021

Pranjal Awasthi, Alex Beutel, Matthaeus Kleindessner, Jamie Morgenstern, Xuezhi Wang

Figure 1 for Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Figure 2 for Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Figure 3 for Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Figure 4 for Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Abstract:Training and evaluation of fair classifiers is a challenging problem. This is partly due to the fact that most fairness metrics of interest depend on both the sensitive attribute information and label information of the data points. In many scenarios it is not possible to collect large datasets with such information. An alternate approach that is commonly used is to separately train an attribute classifier on data with sensitive attribute information, and then use it later in the ML pipeline to evaluate the bias of a given classifier. While such decoupling helps alleviate the problem of demographic scarcity, it raises several natural questions such as: how should the attribute classifier be trained?, and how should one use a given attribute classifier for accurate bias estimation? In this work we study this question from both theoretical and empirical perspectives. We first experimentally demonstrate that the test accuracy of the attribute classifier is not always correlated with its effectiveness in bias estimation for a downstream model. In order to further investigate this phenomenon, we analyze an idealized theoretical model and characterize the structure of the optimal classifier. Our analysis has surprising and counter-intuitive implications where in certain regimes one might want to distribute the error of the attribute classifier as unevenly as possible among the different subgroups. Based on our analysis we develop heuristics for both training and using attribute classifiers for bias estimation in the data scarce regime. We empirically demonstrate the effectiveness of our approach on real and simulated data.

Via

Access Paper or Ask Questions