Alert button
Picture for Vishakh Padmakumar

Vishakh Padmakumar

Alert button

Debate Helps Supervise Unreliable Experts

Nov 15, 2023
Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman

As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can't tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth. We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by 'expert' debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy's 74%. Debates are also more efficient, being 68% of the length of consultancies. By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.

* 84 pages, 13 footnotes, 5 figures, 4 tables, 28 debate transcripts; data and code at https://github.com/julianmichael/debate/tree/2023-nyu-experiments 
Viaarxiv icon

Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers

Sep 25, 2023
Tuhin Chakrabarty, Vishakh Padmakumar, Faeze Brahman, Smaranda Muresan

Figure 1 for Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers
Figure 2 for Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers
Figure 3 for Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers
Figure 4 for Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers

The development of large language models (LLMs) capable of following instructions and engaging in conversational interactions sparked increased interest in their utilization across various support tools. We investigate the utility of modern LLMs in assisting professional writers via an empirical user study (n=30). The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing. Participants are asked to submit a post-completion survey to provide feedback on the potential and pitfalls of LLMs as writing collaborators. Upon analyzing the writer-LLM interactions, we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing. Our findings from analyzing both the interactions and the survey responses highlight future research directions in creative writing assistance using LLMs.

Viaarxiv icon

Does Writing with Language Models Reduce Content Diversity?

Sep 11, 2023
Vishakh Padmakumar, He He

Figure 1 for Does Writing with Language Models Reduce Content Diversity?
Figure 2 for Does Writing with Language Models Reduce Content Diversity?
Figure 3 for Does Writing with Language Models Reduce Content Diversity?
Figure 4 for Does Writing with Language Models Reduce Content Diversity?

Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content, potentially limiting diverse perspectives in public discourse. In this work, we measure the impact of co-writing on diversity via a controlled experiment, where users write argumentative essays in three setups -- using a base LLM (GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity. Specifically, it increases the similarity between the writings of different authors and reduces the overall lexical and content diversity. We additionally find that this effect is mainly attributable to InstructGPT contributing less diverse text to co-written essays. In contrast, the user-contributed text remains unaffected by model collaboration. This suggests that the recent improvement in generation quality from adapting models to human feedback might come at the cost of more homogeneous and less diverse content.

* Preprint 
Viaarxiv icon

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

May 24, 2023
Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He

Figure 1 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Figure 2 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Figure 3 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Figure 4 for Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to longer and compositional proofs. However, they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

Viaarxiv icon

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

May 23, 2023
Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

Figure 1 for Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Figure 2 for Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Figure 3 for Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Figure 4 for Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning and propose two types of self-consistency that are particularly important for multi-step logic -- hypothetical consistency (the ability for a model to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's outputs for a compositional task even when an intermediate step is replaced with the model's output for that step). We demonstrate that four sizes of the GPT-3 model exhibit poor consistency rates across both types of consistency on four different tasks (Wikipedia, DailyDialog, arithmetic, and GeoQuery).

Viaarxiv icon

Extrapolative Controlled Sequence Generation via Iterative Refinement

Mar 08, 2023
Vishakh Padmakumar, Richard Yuanzhe Pang, He He, Ankur P. Parikh

Figure 1 for Extrapolative Controlled Sequence Generation via Iterative Refinement
Figure 2 for Extrapolative Controlled Sequence Generation via Iterative Refinement
Figure 3 for Extrapolative Controlled Sequence Generation via Iterative Refinement
Figure 4 for Extrapolative Controlled Sequence Generation via Iterative Refinement

We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation.

* Preprint 
Viaarxiv icon

Reward Gaming in Conditional Text Generation

Nov 16, 2022
Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, He He

Figure 1 for Reward Gaming in Conditional Text Generation
Figure 2 for Reward Gaming in Conditional Text Generation
Figure 3 for Reward Gaming in Conditional Text Generation
Figure 4 for Reward Gaming in Conditional Text Generation

To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this short discussion piece, we would like to highlight reward gaming in the NLG community using concrete conditional text generation examples and discuss potential fixes and areas for future work.

Viaarxiv icon

Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing

Oct 25, 2022
Tuhin Chakrabarty, Vishakh Padmakumar, He He

Figure 1 for Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing
Figure 2 for Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing
Figure 3 for Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing
Figure 4 for Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing

Recent work in training large language models (LLMs) to follow natural language instructions has opened up exciting opportunities for natural language interface design. Building on the prior success of LLMs in the realm of computer-assisted creativity, we aim to study if LLMs can improve the quality of user-generated content through collaboration. We present CoPoet, a collaborative poetry writing system. In contrast to auto-completing a user's text, CoPoet is controlled by user instructions that specify the attributes of the desired text, such as Write a sentence about `love' or Write a sentence ending in `fly'. The core component of our system is a language model fine-tuned on a diverse collection of instructions for poetry writing. Our model is not only competitive with publicly available LLMs trained on instructions (InstructGPT), but is also capable of satisfying unseen compositional instructions. A study with 15 qualified crowdworkers shows that users successfully write poems with CoPoet on diverse topics ranging from Monarchy to Climate change. Further, the collaboratively written poems are preferred by third-party evaluators over those written without the system.

* To appear at EMNLP 2022 
Viaarxiv icon

Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions

Oct 19, 2022
Alicia Parrish, Harsh Trivedi, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Amanpreet Singh Saimbhi, Samuel R. Bowman

Figure 1 for Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions
Figure 2 for Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions
Figure 3 for Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions
Figure 4 for Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions

The use of language-model-based question-answering systems to aid humans in completing difficult tasks is limited, in part, by the unreliability of the text these systems generate. Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown that just a single turn of arguments in this format is not helpful to humans. However, as debate settings are characterized by a back-and-forth dialogue, we follow up on previous results to test whether adding a second round of counter-arguments is helpful to humans. We find that, regardless of whether they have access to arguments or not, humans perform similarly on our task. These findings suggest that, in the case of answering reading comprehension questions, debate is not a helpful format.

* 12 pages, 6 figures, 7 tables 
Viaarxiv icon