Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christoph Riedl

Artificial collectives of specialists and generalists excel at different tasks

Jun 18, 2026

John Meluso, Laurent Hébert-Dufresne, Christoph Riedl, H. Oliver Gao

Abstract:Collective artificial intelligence, where multiple agents work on shared tasks, holds potential to solve expansive problems in fields from medicine to collective governance. But while prescriptive engineering solutions abound, we lack descriptive scientific understanding of artificial collectives, and therefore principles for how to design resource efficient multi-agent systems. Through systematic experiments with optimizing agents, we characterize how agent interpretive abilities, rationality bounds, and task qualities interact to shape collective performance. Agents range from specialists, with narrow interpretive abilities, to generalists, with broad ones. Collectives of specialists correspond to sparse, centralized networks, while collectives of generalists correspond to dense, decentralized ones. We show that interpretive network properties have small performance effects on average (0.07 standard deviations of performance). However, for specific task qualities, these effects are 4.5 times larger (0.33 sd) and can reach much higher for certain task qualities (1.84 sd). This leads collectives of generalists to perform better on tasks that involve generating, choosing, and coordinating, while collectives of specialists with a few generalist mediators perform better on tasks that involve negotiating. Rationality bounds then moderate these relationships. At loose bounds, specialists outperform generalists through more effective sampling of high-dimensional decision spaces. At tight bounds, generalists outperform specialists through better gradient estimation. A fundamental trade-off between performance and convergence speed emerges at moderate bounds. These findings suggest that multi-agent design could benefit from matching interpretive networks to both task demands and agents' computational limits, with implications for the efficiency and energy costs of multi-agent systems.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

Agents of Chaos

Feb 23, 2026

Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash(+28 more)

Abstract:We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case studies. Observed behaviors include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. In several cases, agents reported task completion while the underlying system state contradicted those reports. We also report on some of the failed attempts. Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings. These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines. This report serves as an initial empirical contribution to that broader conversation.

Via

Access Paper or Ask Questions

Language Models use Lookbacks to Track Beliefs

May 20, 2025

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger

Abstract:How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct's ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

* 32 pages, 32 figures. Code and data at https://belief.baulab.info/

Via

Access Paper or Ask Questions

Effects of AI Feedback on Learning, the Skill Gap, and Intellectual Diversity

Sep 27, 2024

Christoph Riedl, Eric Bogert

Abstract:Can human decision-makers learn from AI feedback? Using data on 52,000 decision-makers from a large online chess platform, we investigate how their AI use affects three interrelated long-term outcomes: Learning, skill gap, and diversity of decision strategies. First, we show that individuals are far more likely to seek AI feedback in situations in which they experienced success rather than failure. This AI feedback seeking strategy turns out to be detrimental to learning: Feedback on successes decreases future performance, while feedback on failures increases it. Second, higher-skilled decision-makers seek AI feedback more often and are far more likely to seek AI feedback after a failure, and benefit more from AI feedback than lower-skilled individuals. As a result, access to AI feedback increases, rather than decreases, the skill gap between high- and low-skilled individuals. Finally, we leverage 42 major platform updates as natural experiments to show that access to AI feedback causes a decrease in intellectual diversity of the population as individuals tend to specialize in the same areas. Together, those results indicate that learning from AI feedback is not automatic and using AI correctly seems to be a skill itself. Furthermore, despite its individual-level benefits, access to AI feedback can have significant population-level downsides including loss of intellectual diversity and an increasing skill gap.

Via

Access Paper or Ask Questions

Large Language Models for Automatic Milestone Detection in Group Discussions

Jun 16, 2024

Zhuoxu Duan, Zhengye Yang, Samuel Westby, Christoph Riedl, Brooke Foucault Welles, Richard J. Radke

Figure 1 for Large Language Models for Automatic Milestone Detection in Group Discussions

Figure 2 for Large Language Models for Automatic Milestone Detection in Group Discussions

Figure 3 for Large Language Models for Automatic Milestone Detection in Group Discussions

Figure 4 for Large Language Models for Automatic Milestone Detection in Group Discussions

Abstract:Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.

Via

Access Paper or Ask Questions

Fast Model-Selection through Adapting Design of Experiments Maximizing Information Gain

Oct 23, 2018

Stefano Balietti, Brennan Klein, Christoph Riedl

Figure 1 for Fast Model-Selection through Adapting Design of Experiments Maximizing Information Gain

Figure 2 for Fast Model-Selection through Adapting Design of Experiments Maximizing Information Gain

Figure 3 for Fast Model-Selection through Adapting Design of Experiments Maximizing Information Gain

Figure 4 for Fast Model-Selection through Adapting Design of Experiments Maximizing Information Gain

Abstract:To perform model-selection efficiently, we must run informative experiments. Here, we extend a seminal method for designing Bayesian optimal experiments that maximize the information gained from data collected. We introduce two computational improvements that make the procedure tractable: a search algorithm from artificial intelligence and a sampling procedure shrinking the space of possible experiments to evaluate. We collected data for five different experimental designs of a simple imperfect information game and show that experiments optimized for information gain make model-selection possible (and cheaper). We compare the ability of the optimal experimental design to discriminate among competing models against the experimental designs chosen by a "wisdom of experts" prediction experiment. We find that a simple reinforcement learning model best explains human decision-making and that subject behavior is not adequately described by Bayesian Nash equilibrium. Our procedure is general and can be applied iteratively to lab, field and online experiments.

Via

Access Paper or Ask Questions

Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms

Nov 11, 2014

Christoph Riedl, Richard Zanibbi, Marti A. Hearst, Siyu Zhu, Michael Menietti, Jason Crusan, Ivan Metelsky, Karim R. Lakhani

Figure 1 for Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms

Figure 2 for Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms

Figure 3 for Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms

Figure 4 for Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms

Abstract:We report the findings of a month-long online competition in which participants developed algorithms for augmenting the digital version of patent documents published by the United States Patent and Trademark Office (USPTO). The goal was to detect figures and part labels in U.S. patent drawing pages. The challenge drew 232 teams of two, of which 70 teams (30%) submitted solutions. Collectively, teams submitted 1,797 solutions that were compiled on the competition servers. Participants reported spending an average of 63 hours developing their solutions, resulting in a total of 5,591 hours of development time. A manually labeled dataset of 306 patents was used for training, online system tests, and evaluation. The design and performance of the top-5 systems are presented, along with a system developed after the competition which illustrates that winning teams produced near state-of-the-art results under strict time and computation constraints. For the 1st place system, the harmonic mean of recall and precision (f-measure) was 88.57% for figure region detection, 78.81% for figure regions with correctly recognized figure titles, and 70.98% for part label detection and character recognition. Data and software from the competition are available through the online UCI Machine Learning repository to inspire follow-on work by the image processing community.

Via

Access Paper or Ask Questions