Alert button
Picture for Su Lin Blodgett

Su Lin Blodgett

Alert button

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Nov 18, 2023
Yu Lu Liu, Meng Cao, Su Lin Blodgett, Jackie Chi Kit Cheung, Alexandra Olteanu, Adam Trischler

AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and other responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization -- a common NLP task largely overlooked by the responsible AI community -- we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020-2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions.

Viaarxiv icon

"One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Oct 23, 2023
Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu

Figure 1 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features
Figure 2 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features
Figure 3 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features
Figure 4 for "One-size-fits-all"? Observations and Expectations of NLG Systems Across Identity-Related Language Features

Fairness-related assumptions about what constitutes appropriate NLG system behaviors range from invariance, where systems are expected to respond identically to social groups, to adaptation, where responses should instead vary across them. We design and conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs to illuminate tensions around invariance and adaptation. We outline people's expectations of system behaviors, and surface potential caveats of these two contrasting yet commonly-held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around defining what constitutes fair NLG system behavior.

* 36 pages, 24 figures 
Viaarxiv icon

Evaluating the Social Impact of Generative AI Systems in Systems and Society

Jun 12, 2023
Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, Yacine Jernite, Alexandra Sasha Luccioni, Alberto Lusoli, Margaret Mitchell, Jessica Newman, Marie-Therese Png, Andrew Strait, Apostol Vassilev

Generative AI systems across modalities, ranging from text, image, audio, and video, have broad social impacts, but there exists no official standard for means of evaluating those impacts and which impacts should be evaluated. We move toward a standard approach in evaluating a generative AI system for any modality, in two overarching categories: what is able to be evaluated in a base system that has no predetermined application and what is able to be evaluated in society. We describe specific social impact categories and how to approach and conduct evaluations in the base technical system, then in people and society. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to all modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what is able to be evaluated in society, each with their own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm. We are concurrently crafting an evaluation repository for the AI research community to contribute existing evaluations along the given categories. This version will be updated following a CRAFT session at ACM FAccT 2023.

Viaarxiv icon

This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models

May 22, 2023
Seraphina Goldfarb-Tarrant, Eddie Ungless, Esma Balkir, Su Lin Blodgett

Figure 1 for This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models
Figure 2 for This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models
Figure 3 for This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models
Figure 4 for This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models

Bias research in NLP seeks to analyse models for social biases, thus helping NLP practitioners uncover, measure, and mitigate social harms. We analyse the body of work that uses prompts and templates to assess bias in language models. We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure and how that measurement is carried out. By applying this taxonomy to 90 bias tests, we illustrate qualitatively and quantitatively that core aspects of bias test conceptualisations and operationalisations are frequently unstated or ambiguous, carry implicit assumptions, or be mismatched. Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched. We offer guidance to enable the community to explore a wider section of the possible bias space, and to better close the gap between desired outcomes and experimental design, both for bias and for evaluating language models more broadly.

* Accepted to ACL Findings 2023 
Viaarxiv icon

It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance

May 15, 2023
Arjun Subramonian, Xingdi Yuan, Hal Daumé III, Su Lin Blodgett

Figure 1 for It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
Figure 2 for It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
Figure 3 for It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
Figure 4 for It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance

Progress in NLP is increasingly measured through benchmarks; hence, contextualizing progress requires understanding when and why practitioners may disagree about the validity of benchmarks. We develop a taxonomy of disagreement, drawing on tools from measurement modeling, and distinguish between two types of disagreement: 1) how tasks are conceptualized and 2) how measurements of model performance are operationalized. To provide evidence for our taxonomy, we conduct a meta-analysis of relevant literature to understand how NLP tasks are conceptualized, as well as a survey of practitioners about their impressions of different factors that affect benchmark validity. Our meta-analysis and survey across eight tasks, ranging from coreference resolution to question answering, uncover that tasks are generally not clearly and consistently conceptualized and benchmarks suffer from operationalization disagreements. These findings support our proposed taxonomy of disagreement. Finally, based on our taxonomy, we present a framework for constructing benchmarks and documenting their limitations.

* Findings of the Association for Computational Linguistics: ACL 2023  
Viaarxiv icon

Fairness and Sequential Decision Making: Limits, Lessons, and Opportunities

Jan 13, 2023
Samer B. Nashed, Justin Svegliato, Su Lin Blodgett

As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated. However, various research communities have independently conceptualized these harms, envisioned potential applications, and proposed interventions. The result is a somewhat fractured landscape of literature focused generally on ensuring decision-making algorithms "do the right thing". In this paper, we compare and discuss work across two major subsets of this literature: algorithmic fairness, which focuses primarily on predictive systems, and ethical decision making, which focuses primarily on sequential decision making and planning. We explore how each of these settings has articulated its normative concerns, the viability of different techniques for these different settings, and how ideas from each setting may have utility for the other.

* 10 pages 
Viaarxiv icon

Examining Political Rhetoric with Epistemic Stance Detection

Jan 06, 2023
Ankita Gupta, Su Lin Blodgett, Justin H Gross, Brendan O'Connor

Figure 1 for Examining Political Rhetoric with Epistemic Stance Detection
Figure 2 for Examining Political Rhetoric with Epistemic Stance Detection
Figure 3 for Examining Political Rhetoric with Epistemic Stance Detection
Figure 4 for Examining Political Rhetoric with Epistemic Stance Detection

Participants in political discourse employ rhetorical strategies -- such as hedging, attributions, or denials -- to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance prediction, drawn from research in computational semantics, to distinguish at the clausal level what is asserted, denied, or only ambivalently suggested by the author or other mentioned entities (belief holders). We first develop a simple RoBERTa-based model for multi-source stance predictions that outperforms more complex state-of-the-art modeling. Then we demonstrate its novel application to political science by conducting a large-scale analysis of the Mass Market Manifestos corpus of U.S. political opinion books, where we characterize trends in cited belief holders -- respected allies and opposed bogeymen -- across U.S. political ideologies.

* Forthcoming in Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS) at EMNLP 2022 
Viaarxiv icon

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

May 13, 2022
Kaitlyn Zhou, Su Lin Blodgett, Adam Trischler, Hal Daumé III, Kaheer Suleman, Alexandra Olteanu

Figure 1 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Figure 2 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Figure 3 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Figure 4 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, and how to evaluate -- are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations.

* Camera Ready for NAACL 2022 (Main Conference) 
Viaarxiv icon

Risks of AI Foundation Models in Education

Oct 19, 2021
Su Lin Blodgett, Michael Madaio

If the authors of a recent Stanford report (Bommasani et al., 2021) on the opportunities and risks of "foundation models" are to be believed, these models represent a paradigm shift for AI and for the domains in which they will supposedly be used, including education. Although the name is new (and contested (Field, 2021)), the term describes existing types of algorithmic models that are "trained on broad data at scale" and "fine-tuned" (i.e., adapted) for particular downstream tasks, and is intended to encompass large language models such as BERT or GPT-3 and computer vision models such as CLIP. Such technologies have the potential for harm broadly speaking (e.g., Bender et al., 2021), but their use in the educational domain is particularly fraught, despite the potential benefits for learners claimed by the authors. In section 3.3 of the Stanford report, Malik et al. argue that achieving the goal of providing education for all learners requires more efficient computational approaches that can rapidly scale across educational domains and across educational contexts, for which they argue foundation models are uniquely well-suited. However, evidence suggests that not only are foundation models not likely to achieve the stated benefits for learners, but their use may also introduce new risks for harm.

Viaarxiv icon

A Survey of Race, Racism, and Anti-Racism in NLP

Jul 15, 2021
Anjalie Field, Su Lin Blodgett, Zeerak Waseem, Yulia Tsvetkov

Figure 1 for A Survey of Race, Racism, and Anti-Racism in NLP
Figure 2 for A Survey of Race, Racism, and Anti-Racism in NLP
Figure 3 for A Survey of Race, Racism, and Anti-Racism in NLP
Figure 4 for A Survey of Race, Racism, and Anti-Racism in NLP

Despite inextricable ties between race and language, little work has considered race in NLP research and development. In this work, we survey 79 papers from the ACL anthology that mention race. These papers reveal various types of race-related bias in all stages of NLP model development, highlighting the need for proactive consideration of how NLP systems can uphold racial hierarchies. However, persistent gaps in research on race and NLP remain: race has been siloed as a niche topic and remains ignored in many NLP tasks; most work operationalizes race as a fixed single-dimensional variable with a ground-truth label, which risks reinforcing differences produced by historical racism; and the voices of historically marginalized people are nearly absent in NLP literature. By identifying where and how NLP literature has and has not considered race, especially in comparison to related fields, our work calls for inclusion and racial justice in NLP research practices.

* Accepted to ACL 2021 
Viaarxiv icon