Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hanna Wallach

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Jan 26, 2026

Alexandra Chouldechova, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, Hanna Wallach

Abstract:We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASR comparisons and measurement validity challenges.

* NeurIPS 2025

Via

Access Paper or Ask Questions

Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Jun 04, 2025

Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach

Abstract:The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

* Findings of the Association for Computational Linguistics: ACL 2025

Via

Access Paper or Ask Questions

Taxonomizing Representational Harms using Speech Act Theory

Apr 01, 2025

Emily Corvi, Hannah Washington, Stefanie Reed, Chad Atalla, Alexandra Chouldechova, P. Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Emily Sheng, Dan Vann(+2 more)

Figure 1 for Taxonomizing Representational Harms using Speech Act Theory

Figure 2 for Taxonomizing Representational Harms using Speech Act Theory

Figure 3 for Taxonomizing Representational Harms using Speech Act Theory

Figure 4 for Taxonomizing Representational Harms using Speech Act Theory

Abstract:Representational harms are widely recognized among fairness-related harms caused by generative language systems. However, their definitions are commonly under-specified. We present a framework, grounded in speech act theory (Austin, 1962), that conceptualizes representational harms caused by generative language systems as the perlocutionary effects (i.e., real-world impacts) of particular types of illocutionary acts (i.e., system behaviors). Building on this argument and drawing on relevant literature from linguistic anthropology and sociolinguistics, we provide new definitions stereotyping, demeaning, and erasure. We then use our framework to develop a granular taxonomy of illocutionary acts that cause representational harms, going beyond the high-level taxonomies presented in previous work. We also discuss the ways that our framework and taxonomy can support the development of valid measurement instruments. Finally, we demonstrate the utility of our framework and taxonomy via a case study that engages with recent conceptual debates about what constitutes a representational harm and how such harms should be measured.

Via

Access Paper or Ask Questions

Toward an Evaluation Science for Generative AI Systems

Mar 07, 2025

Laura Weidinger, Deb Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Sayash Kapoor, Deep Ganguli, Sanmi Koyejo(+1 more)

Abstract:There is an increasing imperative to anticipate and understand the performance and safety of generative AI systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems. While generative AI creates unique challenges for system safety engineering and measurement science, the field can draw valuable insights from the development of safety evaluation practices in other fields, including transportation, aerospace, and pharmaceutical engineering. In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established. Applying these insights, we outline a concrete path toward a more rigorous approach for evaluating generative AI systems.

* First two authors contributed equally to this work

Via

Access Paper or Ask Questions

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Mar 07, 2025

Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova

Figure 1 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Figure 2 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Figure 3 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Figure 4 for Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Abstract:The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.

Via

Access Paper or Ask Questions

Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Dec 09, 2024

A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Matthew Jagielski, Katja Filippova, Ken Ziyu Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Niloofar Mireshghallah(+25 more)

Figure 1 for Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Figure 2 for Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Figure 3 for Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Figure 4 for Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Abstract:We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model's parameters, e.g., a particular individual's personal data or in-copyright expression of Spiderman that was included in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives.

* Presented at the 2nd Workshop on Generative AI and Law at ICML (July 2024)

Via

Access Paper or Ask Questions

A Framework for Evaluating LLMs Under Task Indeterminacy

Nov 21, 2024

Luke Guerdan, Hanna Wallach, Solon Barocas, Alexandra Chouldechova

Figure 1 for A Framework for Evaluating LLMs Under Task Indeterminacy

Abstract:Large language model (LLM) evaluations often assume there is a single correct response -- a gold label -- for each item in the evaluation corpus. However, some tasks can be ambiguous -- i.e., they provide insufficient information to identify a unique interpretation -- or vague -- i.e., they do not clearly indicate where to draw the line when making a determination. Both ambiguity and vagueness can cause task indeterminacy -- the condition where some items in the evaluation corpus have more than one correct response. In this paper, we develop a framework for evaluating LLMs under task indeterminacy. Our framework disentangles the relationships between task specification, human ratings, and LLM responses in the LLM evaluation pipeline. Using our framework, we conduct a synthetic experiment showing that evaluations that use the "gold label" assumption underestimate the true performance. We also provide a method for estimating an error-adjusted performance interval given partial knowledge about indeterminate items in the evaluation corpus. We conclude by outlining implications of our work for the research community.

* To Appear in NeurIPS 2024 Workshops on Evaluating Evaluations (EvalEval) and Statistical Foundations of LLMs and Foundation Models (SFLLM)

Via

Access Paper or Ask Questions

A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

Oct 26, 2023

Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz(+7 more)

Figure 1 for A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

Figure 2 for A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

Figure 3 for A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

Figure 4 for A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

Abstract:We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art LLMs, such as GPT-4. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles. The framework may be employed alongside domain-specific sociotechnical expertise to create measurements for new harm areas in the future. By implementing this framework, we aim to enable more advanced harm measurement efforts and further the responsible use of LLMs.

* This is a living document

Via

Access Paper or Ask Questions

"One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Oct 23, 2023

Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu

Figure 1 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Figure 2 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Figure 3 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Figure 4 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Abstract:Fairness-related assumptions about what constitutes appropriate NLG system behaviors range from invariance, where systems are expected to respond identically to social groups, to adaptation, where responses should instead vary across them. We design and conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs to illuminate tensions around invariance and adaptation. We outline people's expectations of system behaviors, and surface potential caveats of these two contrasting yet commonly-held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around defining what constitutes fair NLG system behavior.

* 36 pages, 24 figures

Via

Access Paper or Ask Questions

Measuring Representational Harms in Image Captioning

Jun 14, 2022

Angelina Wang, Solon Barocas, Kristen Laird, Hanna Wallach

Figure 1 for Measuring Representational Harms in Image Captioning

Figure 2 for Measuring Representational Harms in Image Captioning

Figure 3 for Measuring Representational Harms in Image Captioning

Figure 4 for Measuring Representational Harms in Image Captioning

Abstract:Previous work has largely considered the fairness of image captioning systems through the underspecified lens of "bias." In contrast, we present a set of techniques for measuring five types of representational harms, as well as the resulting measurements obtained for two of the most popular image captioning datasets using a state-of-the-art image captioning system. Our goal was not to audit this image captioning system, but rather to develop normatively grounded measurement techniques, in turn providing an opportunity to reflect on the many challenges involved. We propose multiple measurement techniques for each type of harm. We argue that by doing so, we are better able to capture the multi-faceted nature of each type of harm, in turn improving the (collective) validity of the resulting measurements. Throughout, we discuss the assumptions underlying our measurement approach and point out when they do not hold.

* ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2022

Via

Access Paper or Ask Questions