Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zehao Xu

Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

Feb 18, 2026

Zehao Xu, Tony Paek, Kevin O'Sullivan, Attila Dobi

Abstract:Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emph{surrogate-based prevalence measurement} framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emph{score bucketing} as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. Across multiple large-scale A/B tests, we validate that the surrogate estimates closely match the reference estimates for both arm-level prevalence and treatment--control deltas. This enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs.

Via

Access Paper or Ask Questions

Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Apr 28, 2025

Huichi Zhou, Zehao Xu, Munan Zhao, Kaihong Li, Yiqiang Li, Hongtao Wang

Figure 1 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Figure 2 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Figure 3 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Figure 4 for Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Abstract:In this paper, we introduce the Multilingual Moral Reasoning Benchmark (MMRB) to evaluate the moral reasoning abilities of large language models (LLMs) across five typologically diverse languages and three levels of contextual complexity: sentence, paragraph, and document. Our results show moral reasoning performance degrades with increasing context complexity, particularly for low-resource languages such as Vietnamese. We further fine-tune the open-source LLaMA-3-8B model using curated monolingual data for alignment and poisoning. Surprisingly, low-resource languages have a stronger impact on multilingual reasoning than high-resource ones, highlighting their critical role in multilingual NLP.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions