Alert button
Picture for Ida Momennejad

Ida Momennejad

Alert button

ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning

Sep 27, 2023
Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, Ida Momennejad

From grading papers to summarizing medical documents, large language models (LLMs) are evermore used for evaluation of text generated by humans and AI alike. However, despite their extensive utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors. ALLURE involves comparing LLM-generated evaluations with annotated data, and iteratively incorporating instances of significant deviation into the evaluator, which leverages in-context learning (ICL) to enhance and improve robust evaluation of text by LLMs. Through this iterative process, we refine the performance of the evaluator LLM, ultimately reducing reliance on human annotators in the evaluation process. We anticipate ALLURE to serve diverse applications of LLMs in various domains related to evaluation of textual data, such as medical summarization, education, and and productivity.

Viaarxiv icon

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Sep 25, 2023
Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson

Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

Viaarxiv icon

Replay Buffer With Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning

Mar 15, 2023
Ali Rahimi-Kalahroudi, Janarthanan Rajendran, Ida Momennejad, Harm van Seijen, Sarath Chandar

Figure 1 for Replay Buffer With Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning
Figure 2 for Replay Buffer With Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning
Figure 3 for Replay Buffer With Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning
Figure 4 for Replay Buffer With Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning

One of the key behavioral characteristics used in neuroscience to determine whether the subject of study -- be it a rodent or a human -- exhibits model-based learning is effective adaptation to local changes in the environment. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to such changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.

Viaarxiv icon

Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games

Mar 02, 2023
Stephanie Milani, Arthur Juliani, Ida Momennejad, Raluca Georgescu, Jaroslaw Rzpecki, Alison Shaw, Gavin Costello, Fei Fang, Sam Devlin, Katja Hofmann

Figure 1 for Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games
Figure 2 for Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games
Figure 3 for Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games
Figure 4 for Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games

We aim to understand how people assess human likeness in navigation produced by people and artificially intelligent (AI) agents in a video game. To this end, we propose a novel AI agent with the goal of generating more human-like behavior. We collect hundreds of crowd-sourced assessments comparing the human-likeness of navigation behavior generated by our agent and baseline AI agents with human-generated behavior. Our proposed agent passes a Turing Test, while the baseline agents do not. By passing a Turing Test, we mean that human judges could not quantitatively distinguish between videos of a person and an AI agent navigating. To understand what people believe constitutes human-like navigation, we extensively analyze the justifications of these assessments. This work provides insights into the characteristics that people consider human-like in the context of goal-directed video game navigation, which is a key step for further improving human interactions with AI agents.

* 18 pages; accepted at CHI 2023 
Viaarxiv icon

Imitating Human Behaviour with Diffusion Models

Jan 25, 2023
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, Sam Devlin

Figure 1 for Imitating Human Behaviour with Diffusion Models
Figure 2 for Imitating Human Behaviour with Diffusion Models
Figure 3 for Imitating Human Behaviour with Diffusion Models
Figure 4 for Imitating Human Behaviour with Diffusion Models

Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.

* ICLR 2023  
* Published in ICLR 2023 
Viaarxiv icon

A Rubric for Human-like Agents and NeuroAI

Dec 08, 2022
Ida Momennejad

Figure 1 for A Rubric for Human-like Agents and NeuroAI
Figure 2 for A Rubric for Human-like Agents and NeuroAI
Figure 3 for A Rubric for Human-like Agents and NeuroAI
Figure 4 for A Rubric for Human-like Agents and NeuroAI

Researchers across cognitive, neuro-, and computer sciences increasingly reference human-like artificial intelligence and neuroAI. However, the scope and use of the terms are often inconsistent. Contributed research ranges widely from mimicking behaviour, to testing machine learning methods as neurally plausible hypotheses at the cellular or functional levels, or solving engineering problems. However, it cannot be assumed nor expected that progress on one of these three goals will automatically translate to progress in others. Here a simple rubric is proposed to clarify the scope of individual contributions, grounded in their commitments to human-like behaviour, neural plausibility, or benchmark/engineering goals. This is clarified using examples of weak and strong neuroAI and human-like agents, and discussing the generative, corroborate, and corrective ways in which the three dimensions interact with one another. The author maintains that future progress in artificial intelligence will need strong interactions across the disciplines, with iterative feedback loops and meticulous validity tests, leading to both known and yet-unknown advances that may span decades to come.

Viaarxiv icon

Eigen Memory Trees

Oct 31, 2022
Mark Rucker, Jordan T. Ash, John Langford, Paul Mineiro, Ida Momennejad

Figure 1 for Eigen Memory Trees
Figure 2 for Eigen Memory Trees
Figure 3 for Eigen Memory Trees
Figure 4 for Eigen Memory Trees

This work introduces the Eigen Memory Tree (EMT), a novel online memory model for sequential learning scenarios. EMTs store data at the leaves of a binary tree and route new samples through the structure using the principal components of previous experiences, facilitating efficient (logarithmic) access to relevant memories. We demonstrate that EMT outperforms existing online memory approaches, and provide a hybridized EMT-parametric algorithm that enjoys drastically improved performance over purely parametric methods with nearly no downsides. Our findings are validated using 206 datasets from the OpenML repository in both bounded and infinite memory budget situations.

* corrected an author name; corrected title plurality 
Viaarxiv icon

Interaction-Grounded Learning with Action-inclusive Feedback

Jun 16, 2022
Tengyang Xie, Akanksha Saran, Dylan J. Foster, Lekan Molu, Ida Momennejad, Nan Jiang, Paul Mineiro, John Langford

Figure 1 for Interaction-Grounded Learning with Action-inclusive Feedback
Figure 2 for Interaction-Grounded Learning with Action-inclusive Feedback
Figure 3 for Interaction-Grounded Learning with Action-inclusive Feedback
Figure 4 for Interaction-Grounded Learning with Action-inclusive Feedback

Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.

Viaarxiv icon

Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS

Jun 10, 2022
Eleni Nisioti, Mateo Mahaut, Pierre-Yves Oudeyer, Ida Momennejad, Clément Moulin-Frier

Figure 1 for Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS
Figure 2 for Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS
Figure 3 for Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS
Figure 4 for Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS

The human cultural repertoire relies on innovation: our ability to continuously and hierarchically explore how existing elements can be combined to create new ones. Innovation is not solitary, it relies on collective accumulation and merging of previous solutions. Machine learning approaches commonly assume that fully connected multi-agent networks are best suited for innovation. However, human laboratory and field studies have shown that hierarchical innovation is more robustly achieved by dynamic communication topologies. In dynamic topologies, humans oscillate between innovating individually or in small clusters, and then sharing outcomes with others. To our knowledge, the role of multi-agent topology on innovation has not been systematically studied in machine learning. It remains unclear a) which communication topologies are optimal for which innovation tasks, and b) which properties of experience sharing improve multi-level innovation. Here we use a multi-level hierarchical problem setting (WordCraft), with three different innovation tasks. We systematically design networks of DQNs sharing experiences from their replay buffers in varying topologies (fully connected, small world, dynamic, ring). Comparing the level of innovation achieved by different experience-sharing topologies across different tasks shows that, first, consistent with human findings, experience sharing within a dynamic topology achieves the highest level of innovation across tasks. Second, experience sharing is not as helpful when there is a single clear path to innovation. Third, two metrics we propose, conformity and diversity of shared experience, can explain the success of different topologies on different tasks. These contributions can advance our understanding of optimal AI-AI, human-human, and human-AI collaborative networks, inspiring future tools for fostering collective innovation in large organizations.

Viaarxiv icon

Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning

Jun 06, 2022
Arthur Juliani, Samuel Barnett, Brandon Davis, Margaret Sereno, Ida Momennejad

Figure 1 for Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning
Figure 2 for Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning
Figure 3 for Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning
Figure 4 for Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning

In this work we propose Neuro-Nav, an open-source library for neurally plausible reinforcement learning (RL). RL is among the most common modeling frameworks for studying decision making, learning, and navigation in biological organisms. In utilizing RL, cognitive scientists often handcraft environments and agents to meet the needs of their particular studies. On the other hand, artificial intelligence researchers often struggle to find benchmarks for neurally and biologically plausible representation and behavior (e.g., in decision making or navigation). In order to streamline this process across both fields with transparency and reproducibility, Neuro-Nav offers a set of standardized environments and RL algorithms drawn from canonical behavioral and neural studies in rodents and humans. We demonstrate that the toolkit replicates relevant findings from a number of studies across both cognitive science and RL literatures. We furthermore describe ways in which the library can be extended with novel algorithms (including deep RL) and environments to address future research needs of the field.

Viaarxiv icon