Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Adam Visokay

Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

Mar 22, 2026

Navya Mehrotra, Adam Visokay, Kristina Gligorić

Abstract:Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

Via

Access Paper or Ask Questions

GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves

Jul 26, 2024

Denis Peskoff, Adam Visokay, Sander Schulhoff, Benjamin Wachspress, Alan Blinder, Brandon M. Stewart

Figure 1 for GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves

Figure 2 for GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves

Figure 3 for GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves

Figure 4 for GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves

Abstract:Markets and policymakers around the world hang on the consequential monetary policy decisions made by the Federal Open Market Committee (FOMC). Publicly available textual documentation of their meetings provides insight into members' attitudes about the economy. We use GPT-4 to quantify dissent among members on the topic of inflation. We find that transcripts and minutes reflect the diversity of member views about the macroeconomic outlook in a way that is lost or omitted from the public statements. In fact, diverging opinions that shed light upon the committee's "true" attitudes are almost entirely omitted from the final statements. Hence, we argue that forecasting FOMC sentiment based solely on statements will not sufficiently reflect dissent among the hawks and doves.

Via

Access Paper or Ask Questions

From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Apr 03, 2024

Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick

Figure 1 for From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Figure 2 for From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Figure 3 for From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Figure 4 for From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Abstract:In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in "prediction-powered inference" to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.

* 12 pages, 7 figures

Via

Access Paper or Ask Questions