Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benjamin Van Roy

Stanford University Department of Electrical Engineering

Efficient Exploration for LLMs

Feb 01, 2024

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

Figure 1 for Efficient Exploration for LLMs

Figure 2 for Efficient Exploration for LLMs

Figure 3 for Efficient Exploration for LLMs

Figure 4 for Efficient Exploration for LLMs

Abstract:We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.

Via

Access Paper or Ask Questions

An Information-Theoretic Analysis of In-Context Learning

Jan 28, 2024

Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy

Abstract:Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted. We introduce new information-theoretic tools that lead to an elegant and very general decomposition of error into three components: irreducible error, meta-learning error, and intra-task error. These tools unify analyses across many meta-learning challenges. To illustrate, we apply them to establish new results about in-context learning with transformers. Our theoretical results characterizes how error decays in both the number of training sequences and sequence lengths. Our results are very general; for example, they avoid contrived mixing time assumptions made by all prior results that establish decay of error with sequence length.

Via

Access Paper or Ask Questions

RLHF and IIA: Perverse Incentives

Dec 02, 2023

Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

Figure 1 for RLHF and IIA: Perverse Incentives

Figure 2 for RLHF and IIA: Perverse Incentives

Figure 3 for RLHF and IIA: Perverse Incentives

Figure 4 for RLHF and IIA: Perverse Incentives

Abstract:Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA give rise to egregious behavior when innovating on query formats or learning algorithms.

Via

Access Paper or Ask Questions

Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

Oct 14, 2023

Zheqing Zhu, Yueyang Liu, Xu Kuang, Benjamin Van Roy

Figure 1 for Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

Figure 2 for Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

Figure 3 for Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

Figure 4 for Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

Abstract:Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends. While a number of non-stationary contextual bandit learning algorithms have been proposed in the literature, they excessively explore due to a lack of prioritization for information of enduring value, or are designed in ways that do not scale in modern applications with high-dimensional user-specific features and large action set, or both. In this paper, we introduce a novel non-stationary contextual bandit algorithm that addresses these concerns. It combines a scalable, deep-neural-network-based architecture with a carefully designed exploration mechanism that strategically prioritizes collecting information with the most lasting value in a non-stationary environment. Through empirical evaluations on two real-world recommendation datasets, which exhibit pronounced non-stationarity, we demonstrate that our approach significantly outperforms the state-of-the-art baselines.

Via

Access Paper or Ask Questions

Maintaining Plasticity via Regenerative Regularization

Aug 23, 2023

Saurabh Kumar, Henrik Marklund, Benjamin Van Roy

Abstract:In continual learning, plasticity refers to the ability of an agent to quickly adapt to new information. Neural networks are known to lose plasticity when processing non-stationary data streams. In this paper, we propose L2 Init, a very simple approach for maintaining plasticity by incorporating in the loss function L2 regularization toward initial parameters. This is very similar to standard L2 regularization (L2), the only difference being that L2 regularizes toward the origin. L2 Init is simple to implement and requires selecting only a single hyper-parameter. The motivation for this method is the same as that of methods that reset neurons or parameter values. Intuitively, when recent losses are insensitive to particular parameters, these parameters drift toward their initial values. This prepares parameters to adapt quickly to new tasks. On simple problems representative of different types of nonstationarity in continual learning, we demonstrate that L2 Init consistently mitigates plasticity loss. We additionally find that our regularization term reduces parameter magnitudes and maintains a high effective feature rank.

Via

Access Paper or Ask Questions

A Definition of Continual Reinforcement Learning

Jul 20, 2023

David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh

Abstract:In this paper we develop a foundation for continual reinforcement learning.

Via

Access Paper or Ask Questions

On the Convergence of Bounded Agents

Jul 20, 2023

David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh

Abstract:When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly less clear. In this paper, we propose two complementary accounts of agent convergence in a framing of the reinforcement learning problem that centers around bounded agents. The first view says that a bounded agent has converged when the minimal number of states needed to describe the agent's future behavior cannot decrease. The second view says that a bounded agent has converged just when the agent's performance only changes if the agent's internal state changes. We establish basic properties of these two definitions, show that they accommodate typical views of convergence in standard settings, and prove several facts about their nature and relationship. We take these perspectives, definitions, and analysis to bring clarity to a central idea of the field.

Via

Access Paper or Ask Questions

Continual Learning as Computationally Constrained Reinforcement Learning

Jul 10, 2023

Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Yueyang Liu, Benjamin Van Roy

Figure 1 for Continual Learning as Computationally Constrained Reinforcement Learning

Figure 2 for Continual Learning as Computationally Constrained Reinforcement Learning

Figure 3 for Continual Learning as Computationally Constrained Reinforcement Learning

Figure 4 for Continual Learning as Computationally Constrained Reinforcement Learning

Abstract:An agent that efficiently accumulates knowledge to develop increasingly sophisticated skills over a long lifetime could advance the frontier of artificial intelligence capabilities. The design of such agents, which remains a long-standing challenge of artificial intelligence, is addressed by the subject of continual learning. This monograph clarifies and formalizes concepts of continual learning, introducing a framework and set of tools to stimulate further research.

Via

Access Paper or Ask Questions

Scalable Neural Contextual Bandit for Recommender Systems

Jun 26, 2023

Zheqing Zhu, Benjamin Van Roy

Abstract:High-quality recommender systems ought to deliver both innovative and relevant content through effective and exploratory interactions with users. Yet, supervised learning-based neural networks, which form the backbone of many existing recommender systems, only leverage recognized user interests, falling short when it comes to efficiently uncovering unknown user preferences. While there has been some progress with neural contextual bandit algorithms towards enabling online exploration through neural networks, their onerous computational demands hinder widespread adoption in real-world recommender systems. In this work, we propose a scalable sample-efficient neural contextual bandit algorithm for recommender systems. To do this, we design an epistemic neural network architecture, Epistemic Neural Recommendation (ENR), that enables Thompson sampling at a large scale. In two distinct large-scale experiments with real-world tasks, ENR significantly boosts click-through rates and user ratings by at least 9% and 6% respectively compared to state-of-the-art neural contextual bandit algorithms. Furthermore, it achieves equivalent performance with at least 29% fewer user interactions compared to the best-performing baseline algorithm. Remarkably, while accomplishing these improvements, ENR demands orders of magnitude fewer computational resources than neural contextual bandit baseline algorithms.

Via

Access Paper or Ask Questions

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

May 19, 2023

Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy

Figure 1 for Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Figure 2 for Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Figure 3 for Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Figure 4 for Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Abstract:A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.

Via

Access Paper or Ask Questions