Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Davide Mazzaccara

Playpen: An Environment for Exploring Learning Through Conversational Interaction

Apr 11, 2025

Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Momentè, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, Raffaella Bernardi(+6 more)

Figure 1 for Playpen: An Environment for Exploring Learning Through Conversational Interaction

Figure 2 for Playpen: An Environment for Exploring Learning Through Conversational Interaction

Figure 3 for Playpen: An Environment for Exploring Learning Through Conversational Interaction

Figure 4 for Playpen: An Environment for Exploring Learning Through Conversational Interaction

Abstract:Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for "alignment" (with a reward model judging the quality of instruction following attempts) and for improving "reasoning" (process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.

* Source code: https://github.com/lm-playpen/playpen Please send correspodence to: lm-playschool@googlegroups.com

Via

Access Paper or Ask Questions

Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

Jun 25, 2024

Davide Mazzaccara, Alberto Testoni, Raffaella Bernardi

Figure 1 for Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

Figure 2 for Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

Figure 3 for Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

Figure 4 for Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain

Abstract:Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models (LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain (EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We sample multiple questions from the same model (LLAMA 2-CHAT 7B) for each game and create pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of EIG), even in domains different from those used to train the DPO model.

Via

Access Paper or Ask Questions