Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongning Wang

User Welfare Optimization in Recommender Systems with Competing Content Creators

Apr 28, 2024

Fan Yao, Yiming Liao, Mingzhe Wu, Chuanhao Li, Yan Zhu, James Yang, Qifan Wang, Haifeng Xu, Hongning Wang

Figure 1 for User Welfare Optimization in Recommender Systems with Competing Content Creators

Figure 2 for User Welfare Optimization in Recommender Systems with Competing Content Creators

Figure 3 for User Welfare Optimization in Recommender Systems with Competing Content Creators

Figure 4 for User Welfare Optimization in Recommender Systems with Competing Content Creators

Abstract:Driven by the new economic opportunities created by the creator economy, an increasing number of content creators rely on and compete for revenue generated from online content recommendation platforms. This burgeoning competition reshapes the dynamics of content distribution and profoundly impacts long-term user welfare on the platform. However, the absence of a comprehensive picture of global user preference distribution often traps the competition, especially the creators, in states that yield sub-optimal user welfare. To encourage creators to best serve a broad user population with relevant content, it becomes the platform's responsibility to leverage its information advantage regarding user preference distribution to accurately signal creators. In this study, we perform system-side user welfare optimization under a competitive game setting among content creators. We propose an algorithmic solution for the platform, which dynamically computes a sequence of weights for each user based on their satisfaction of the recommended content. These weights are then utilized to design mechanisms that adjust the recommendation policy or the post-recommendation rewards, thereby influencing creators' content production strategies. To validate the effectiveness of our proposed method, we report our findings from a series of experiments, including: 1. a proof-of-concept negative example illustrating how creators' strategies converge towards sub-optimal states without platform intervention; 2. offline experiments employing our proposed intervention mechanisms on diverse datasets; and 3. results from a three-week online experiment conducted on a leading short-video recommendation platform.

Via

Access Paper or Ask Questions

ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

Apr 03, 2024

Zhenyu Hou, Yilin Niu, Zhengxiao Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang(+1 more)

Figure 1 for ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

Figure 2 for ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

Figure 3 for ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

Figure 4 for ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

Abstract:ChatGLM is a free-to-use AI service powered by the ChatGLM family of large language models (LLMs). In this paper, we present the ChatGLM-RLHF pipeline -- a reinforcement learning from human feedback (RLHF) system -- designed to enhance ChatGLM's alignment with human preferences. ChatGLM-RLHF encompasses three major components: the collection of human preference data, the training of the reward model, and the optimization of policies. Throughout the process of integrating ChatGLM-RLHF into production, we encountered and addressed several unprecedented challenges. We introduce the strategies to mitigate reward variance for stabilized large-scale training, implement model parallelism with fused gradient-descent, and design regularization constraints to avoid catastrophic forgetting in LLMs. Experiments show that ChatGLM-RLHF brings significant improvements in alignment tasks compared to the supervised fine-tuned (SFT) version of ChatGLM. For instance, it achieves on average 15\% more wins against ChatGLM-SFT in Chinese alignment tasks. The work presents our practices of aligning LLMs with human preferences, offering insights into the challenges and solutions in RLHF implementations.

Via

Access Paper or Ask Questions

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Mar 08, 2024

Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu

Figure 1 for Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Figure 2 for Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Figure 3 for Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Abstract:We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

Via

Access Paper or Ask Questions

Federated Linear Contextual Bandits with Heterogeneous Clients

Feb 29, 2024

Ethan Blaser, Chuanhao Li, Hongning Wang

Abstract:The demand for collaborative and private bandit learning across multiple agents is surging due to the growing quantity of data generated from distributed systems. Federated bandit learning has emerged as a promising framework for private, efficient, and decentralized online learning. However, almost all previous works rely on strong assumptions of client homogeneity, i.e., all participating clients shall share the same bandit model; otherwise, they all would suffer linear regret. This greatly restricts the application of federated bandit learning in practice. In this work, we introduce a new approach for federated bandits for heterogeneous clients, which clusters clients for collaborative bandit learning under the federated learning setting. Our proposed algorithm achieves non-trivial sub-linear regret and communication cost for all clients, subject to the communication protocol under federated learning that at anytime only one model can be shared by the server.

Via

Access Paper or Ask Questions

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Feb 26, 2024

Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang(+1 more)

Figure 1 for ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Figure 2 for ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Figure 3 for ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Figure 4 for ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Abstract:The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective in real-world situations as a safety evaluator for advanced LLMs. We release ShieldLM at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of LLMs.

* 17 pages

Via

Access Paper or Ask Questions

Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits

Feb 21, 2024

Zhiwei Wang, Huazheng Wang, Hongning Wang

Abstract:Adversarial attacks against stochastic multi-armed bandit (MAB) algorithms have been extensively studied in the literature. In this work, we focus on reward poisoning attacks and find most existing attacks can be easily detected by our proposed detection method based on the test of homogeneity, due to their aggressive nature in reward manipulations. This motivates us to study the notion of stealthy attack against stochastic MABs and investigate the resulting attackability. Our analysis shows that against two popularly employed MAB algorithms, UCB1 and $\epsilon$-greedy, the success of a stealthy attack depends on the environmental conditions and the realized reward of the arm pulled in the first round. We also analyze the situation for general MAB algorithms equipped with our attack detection method and find that it is possible to have a stealthy attack that almost always succeeds. This brings new insights into the security risks of MAB algorithms.

Via

Access Paper or Ask Questions

Incentivized Truthful Communication for Federated Bandits

Feb 07, 2024

Zhepei Wei, Chuanhao Li, Tianze Ren, Haifeng Xu, Hongning Wang

Abstract:To enhance the efficiency and practicality of federated bandit learning, recent advances have introduced incentives to motivate communication among clients, where a client participates only when the incentive offered by the server outweighs its participation cost. However, existing incentive mechanisms naively assume the clients are truthful: they all report their true cost and thus the higher cost one participating client claims, the more the server has to pay. Therefore, such mechanisms are vulnerable to strategic clients aiming to optimize their own utility by misreporting. To address this issue, we propose an incentive compatible (i.e., truthful) communication protocol, named Truth-FedBan, where the incentive for each participant is independent of its self-reported cost, and reporting the true cost is the only way to achieve the best utility. More importantly, Truth-FedBan still guarantees the sub-linear regret and communication cost without any overheads. In other words, the core conceptual contribution of this paper is, for the first time, demonstrating the possibility of simultaneously achieving incentive compatibility and nearly optimal regret in federated bandit learning. Extensive numerical studies further validate the effectiveness of our proposed solution.

* 20 pages, 2 figures. Accepted at ICLR 2024

Via

Access Paper or Ask Questions

Towards Efficient and Exact Optimization of Language Model Alignment

Feb 02, 2024

Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, Minlie Huang

Figure 1 for Towards Efficient and Exact Optimization of Language Model Alignment

Figure 2 for Towards Efficient and Exact Optimization of Language Model Alignment

Figure 3 for Towards Efficient and Exact Optimization of Language Model Alignment

Figure 4 for Towards Efficient and Exact Optimization of Language Model Alignment

Abstract:The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. While considered as a straightforward solution, reinforcement learning (RL) suffers from high variance in policy updates, which impedes efficient policy improvement. Recently, direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. Though simple to implement, DPO is derived based on the optimal policy that is not assured to be achieved in practice, which undermines its convergence to the intended solution. In this paper, we propose efficient exact optimization (EXO) of the alignment objective. We prove that EXO is guaranteed to optimize in the same direction as the RL algorithms asymptotically for arbitary parametrization of the policy, while enables efficient optimization by circumventing the complexities associated with RL algorithms. We compare our method to DPO with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data.

* 24 pages, 9 figures

Via

Access Paper or Ask Questions

AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Feb 02, 2024

Jian Guan, Wei Wu, Zujie Wen, Peng Xu, Hongning Wang, Minlie Huang

Figure 1 for AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Figure 2 for AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Figure 3 for AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Figure 4 for AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

Abstract:The notable success of large language models (LLMs) has sparked an upsurge in building language agents to complete various complex tasks. We present AMOR, an agent framework based on open-source LLMs, which reasons with external knowledge bases and adapts to specific domains through human supervision to the reasoning process. AMOR builds reasoning logic over a finite state machine (FSM) that solves problems through autonomous executions and transitions over disentangled modules. This allows humans to provide direct feedback to the individual modules, and thus naturally forms process supervision. Based on this reasoning and feedback framework, we develop AMOR through two-stage fine-tuning: warm-up and adaptation. The former fine-tunes the LLM with examples automatically constructed from various public datasets and enables AMOR to generalize across different knowledge environments, while the latter tailors AMOR to specific domains using process feedback. Extensive experiments across multiple domains demonstrate the advantage of AMOR to strong baselines, thanks to its FSM-based reasoning and process feedback mechanism.

* Work in progress

Via

Access Paper or Ask Questions

The Impact of Snippet Reliability on Misinformation in Online Health Search

Jan 28, 2024

Anat Hashavit, Tamar Stern, Hongning Wang, Sarit Kraus

Abstract:Search result snippets are crucial in modern search engines, providing users with a quick overview of a website's content. Snippets help users determine the relevance of a document to their information needs, and in certain scenarios even enable them to satisfy those needs without visiting web documents. Hence, it is crucial for snippets to reliably represent the content of their corresponding documents. While this may be a straightforward requirement for some queries, it can become challenging in the complex domain of healthcare, and can lead to misinformation. This paper aims to examine snippets' reliability in representing their corresponding documents, specifically in the health domain. To achieve this, we conduct a series of user studies using Google's search results, where participants are asked to infer viewpoints of search results pertaining to queries about the effectiveness of a medical intervention for a medical condition, based solely on their titles and snippets. Our findings reveal that a considerable portion of Google's snippets (28%) failed to present any viewpoint on the intervention's effectiveness, and that 35% were interpreted by participants as having a different viewpoint compared to their corresponding documents. To address this issue, we propose a snippet extraction solution tailored directly to users' information needs, i.e., extracting snippets that summarize documents' viewpoints regarding the intervention and condition that appear in the query. User study demonstrates that our information need-focused solution outperforms the mainstream query-based approach. With only 19.67% of snippets generated by our solution reported as not presenting a viewpoint and a mere 20.33% misinterpreted by participants. These results strongly suggest that an information need-focused approach can significantly improve the reliability of extracted snippets in online health search.

Via

Access Paper or Ask Questions