Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ziwei Zhu

ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

Oct 31, 2025

Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Yizhou Han, Yufeng Lin, Zhihang Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun(+1 more)

Abstract:Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.

Via

Access Paper or Ask Questions

Talent or Luck? Evaluating Attribution Bias in Large Language Models

May 28, 2025

Chahat Raj, Mahika Banerjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Figure 1 for Talent or Luck? Evaluating Attribution Bias in Large Language Models

Figure 2 for Talent or Luck? Evaluating Attribution Bias in Large Language Models

Figure 3 for Talent or Luck? Evaluating Attribution Bias in Large Language Models

Figure 4 for Talent or Luck? Evaluating Attribution Bias in Large Language Models

Abstract:When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.

* 18 pages

Via

Access Paper or Ask Questions

VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

May 28, 2025

Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Figure 1 for VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Figure 2 for VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Figure 3 for VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Figure 4 for VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Abstract:While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

* 17 pages

Via

Access Paper or Ask Questions

Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

May 27, 2025

Mehrdad Fazli, Bowen Wei, Ziwei Zhu

Figure 1 for Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Figure 2 for Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Figure 3 for Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Figure 4 for Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Abstract:Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current inference-time interventions, while training-free, struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding based on the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

Via

Access Paper or Ask Questions

Learning to Explain: Prototype-Based Surrogate Models for LLM Classification

May 25, 2025

Bowen Wei, Ziwei Zhu

Abstract:Large language models (LLMs) have demonstrated impressive performance on natural language tasks, but their decision-making processes remain largely opaque. Existing explanation methods either suffer from limited faithfulness to the model's reasoning or produce explanations that humans find difficult to understand. To address these challenges, we propose \textbf{ProtoSurE}, a novel prototype-based surrogate framework that provides faithful and human-understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as human-understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms SOTA explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.

Via

Access Paper or Ask Questions

Measuring South Asian Biases in Large Language Models

May 24, 2025

Mamnuya Rinki, Chahat Raj, Anjishnu Mukherjee, Ziwei Zhu

Abstract:Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.

Via

Access Paper or Ask Questions

Federated Conversational Recommender System

Mar 02, 2025

Allen Lin, Jianling Wang, Ziwei Zhu, James Caverlee

Figure 1 for Federated Conversational Recommender System

Figure 2 for Federated Conversational Recommender System

Figure 3 for Federated Conversational Recommender System

Figure 4 for Federated Conversational Recommender System

Abstract:Conversational Recommender Systems (CRSs) have become increasingly popular as a powerful tool for providing personalized recommendation experiences. By directly engaging with users in a conversational manner to learn their current and fine-grained preferences, a CRS can quickly derive recommendations that are relevant and justifiable. However, existing conversational recommendation systems (CRSs) typically rely on a centralized training and deployment process, which involves collecting and storing explicitly-communicated user preferences in a centralized repository. These fine-grained user preferences are completely human-interpretable and can easily be used to infer sensitive information (e.g., financial status, political stands, and health information) about the user, if leaked or breached. To address the user privacy concerns in CRS, we first define a set of privacy protection guidelines for preserving user privacy under the conversational recommendation setting. Based on these guidelines, we propose a novel federated conversational recommendation framework that effectively reduces the risk of exposing user privacy by (i) de-centralizing both the historical interests estimation stage and the interactive preference elicitation stage and (ii) strictly bounding privacy leakage by enforcing user-level differential privacy with meticulously selected privacy budgets. Through extensive experiments, we show that the proposed framework not only satisfies these user privacy protection guidelines, but also enables the system to achieve competitive recommendation performance even when compared to the state-of-the-art non-private conversational recommendation approach.

* ECIR 2024

Via

Access Paper or Ask Questions

Beneath the Surface: How Large Language Models Reflect Hidden Bias

Feb 27, 2025

Jinhao Pan, Chahat Raj, Ziyu Yao, Ziwei Zhu

Figure 1 for Beneath the Surface: How Large Language Models Reflect Hidden Bias

Figure 2 for Beneath the Surface: How Large Language Models Reflect Hidden Bias

Figure 3 for Beneath the Surface: How Large Language Models Reflect Hidden Bias

Figure 4 for Beneath the Surface: How Large Language Models Reflect Hidden Bias

Abstract:The exceptional performance of Large Language Models (LLMs) often comes with the unintended propagation of social biases embedded in their training data. While existing benchmarks evaluate overt bias through direct term associations between bias concept terms and demographic terms, LLMs have become increasingly adept at avoiding biased responses, creating an illusion of neutrality. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Hidden Bias Benchmark (HBB), a novel dataset designed to assess hidden bias that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response to overt bias, they continue to reinforce biases in nuanced settings. Data, code, and results are available at https://github.com/JP-25/Hidden-Bias-Benchmark.

Via

Access Paper or Ask Questions

Generative Semantic Communication: Architectures, Technologies, and Applications

Dec 11, 2024

Jinke Ren, Yaping Sun, Hongyang Du, Weiwen Yuan, Chongjie Wang, Xianda Wang, Yingbin Zhou, Ziwei Zhu, Fangxin Wang, Shuguang Cui

Figure 1 for Generative Semantic Communication: Architectures, Technologies, and Applications

Figure 2 for Generative Semantic Communication: Architectures, Technologies, and Applications

Figure 3 for Generative Semantic Communication: Architectures, Technologies, and Applications

Figure 4 for Generative Semantic Communication: Architectures, Technologies, and Applications

Abstract:This paper delves into the applications of generative artificial intelligence (GAI) in semantic communication (SemCom) and presents a thorough study. Three popular SemCom systems enabled by classical GAI models are first introduced, including variational autoencoders, generative adversarial networks, and diffusion models. For each system, the fundamental concept of the GAI model, the corresponding SemCom architecture, and the associated literature review of recent efforts are elucidated. Then, a novel generative SemCom system is proposed by incorporating the cutting-edge GAI technology-large language models (LLMs). This system features two LLM-based AI agents at both the transmitter and receiver, serving as "brains" to enable powerful information understanding and content regeneration capabilities, respectively. This innovative design allows the receiver to directly generate the desired content, instead of recovering the bit stream, based on the coded semantic information conveyed by the transmitter. Therefore, it shifts the communication mindset from "information recovery" to "information regeneration" and thus ushers in a new era of generative SemCom. A case study on point-to-point video retrieval is presented to demonstrate the superiority of the proposed generative SemCom system, showcasing a 99.98% reduction in communication overhead and a 53% improvement in retrieval accuracy compared to the traditional communication system. Furthermore, four typical application scenarios for generative SemCom are delineated, followed by a discussion of three open issues warranting future investigation. In a nutshell, this paper provides a holistic set of guidelines for applying GAI in SemCom, paving the way for the efficient implementation of generative SemCom in future wireless networks.

* 18 pages, 8 figures

Via

Access Paper or Ask Questions

Cross-Domain Recommendation Meets Large Language Models

Nov 29, 2024

Ajay Krishna Vajjala, Dipak Meher, Ziwei Zhu, David S. Rosenblum

Figure 1 for Cross-Domain Recommendation Meets Large Language Models

Figure 2 for Cross-Domain Recommendation Meets Large Language Models

Figure 3 for Cross-Domain Recommendation Meets Large Language Models

Figure 4 for Cross-Domain Recommendation Meets Large Language Models

Abstract:Cross-domain recommendation (CDR) has emerged as a promising solution to the cold-start problem, faced by single-domain recommender systems. However, existing CDR models rely on complex neural architectures, large datasets, and significant computational resources, making them less effective in data-scarce scenarios or when simplicity is crucial. In this work, we leverage the reasoning capabilities of large language models (LLMs) and explore their performance in the CDR domain across multiple domain pairs. We introduce two novel prompt designs tailored for CDR and demonstrate that LLMs, when prompted effectively, outperform state-of-the-art CDR baselines across various metrics and domain combinations in the rating prediction and ranking tasks. This work bridges the gap between LLMs and recommendation systems, showcasing their potential as effective cross-domain recommenders.

* 12 pages

Via

Access Paper or Ask Questions