Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yicheng Gao

Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge

Jan 20, 2026

Xiaolin Zhou, Zheng Luo, Yicheng Gao, Qixuan Chen, Xiyang Hu, Yue Zhao, Ruishan Liu

Abstract:Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.

Via

Access Paper or Ask Questions

FairREAD: Re-fusing Demographic Attributes after Disentanglement for Fair Medical Image Classification

Dec 20, 2024

Yicheng Gao, Jinkui Hao, Bo Zhou

Abstract:Recent advancements in deep learning have shown transformative potential in medical imaging, yet concerns about fairness persist due to performance disparities across demographic subgroups. Existing methods aim to address these biases by mitigating sensitive attributes in image data; however, these attributes often carry clinically relevant information, and their removal can compromise model performance-a highly undesirable outcome. To address this challenge, we propose Fair Re-fusion After Disentanglement (FairREAD), a novel, simple, and efficient framework that mitigates unfairness by re-integrating sensitive demographic attributes into fair image representations. FairREAD employs orthogonality constraints and adversarial training to disentangle demographic information while using a controlled re-fusion mechanism to preserve clinically relevant details. Additionally, subgroup-specific threshold adjustments ensure equitable performance across demographic groups. Comprehensive evaluations on a large-scale clinical X-ray dataset demonstrate that FairREAD significantly reduces unfairness metrics while maintaining diagnostic accuracy, establishing a new benchmark for fairness and performance in medical image classification.

* Submitted to Medical Image Analysis, code will be available after review is complete

Via

Access Paper or Ask Questions

Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Nov 07, 2024

Yicheng Gao, Gonghan Xu, Zhe Wang, Arman Cohan

Figure 1 for Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Figure 2 for Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Figure 3 for Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Figure 4 for Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Abstract:Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of text generations from LLMs. However, applying LLM evaluators naively to compare or judge between different systems can lead to unreliable results due to the intrinsic win rate estimation bias of LLM evaluators. In order to mitigate this problem, we propose two calibration methods, Bayesian Win Rate Sampling (BWRS) and Bayesian Dawid-Skene, both of which leverage Bayesian inference to more accurately infer the true win rate of generative language models. We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks. We show that both our methods are effective in improving the accuracy of win rate estimation using LLMs as evaluators, offering a promising direction for reliable automatic text quality evaluation.

* Accepted by EMNLP 2024

Via

Access Paper or Ask Questions

Nuclear Norm based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes

May 06, 2014

Jian Yang, Jianjun Qian, Lei Luo, Fanlong Zhang, Yicheng Gao

Figure 1 for Nuclear Norm based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes

Figure 2 for Nuclear Norm based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes

Figure 3 for Nuclear Norm based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes

Figure 4 for Nuclear Norm based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes

Abstract:Recently regression analysis becomes a popular tool for face recognition. The existing regression methods all use the one-dimensional pixel-based error model, which characterizes the representation error pixel by pixel individually and thus neglects the whole structure of the error image. We observe that occlusion and illumination changes generally lead to a low-rank error image. To make use of this low-rank structural information, this paper presents a two-dimensional image matrix based error model, i.e. matrix regression, for face representation and classification. Our model uses the minimal nuclear norm of representation error image as a criterion, and the alternating direction method of multipliers method to calculate the regression coefficients. Compared with the current regression methods, the proposed Nuclear Norm based Matrix Regression (NMR) model is more robust for alleviating the effect of illumination, and more intuitive and powerful for removing the structural noise caused by occlusion. We experiment using four popular face image databases, the Extended Yale B database, the AR database, the Multi-PIE and the FRGC database. Experimental results demonstrate the performance advantage of NMR over the state-of-the-art regression based face recognition methods.

* 30 pages

Via

Access Paper or Ask Questions