Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Shapiro

Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Aug 01, 2025

Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Figure 1 for Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Figure 2 for Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Figure 3 for Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Figure 4 for Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Abstract:This is the third in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate two commonly held prompting beliefs: a) offering to tip the AI model and b) threatening the AI model. Tipping was a commonly shared tactic for improving AI performance and threats have been endorsed by Google Founder Sergey Brin (All-In, May 2025, 8:20) who observed that 'models tend to do better if you threaten them,' a claim we subject to empirical testing here. We evaluate model performance on GPQA (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024). We demonstrate two things: - Threatening or tipping a model generally has no significant effect on benchmark performance. - Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Taken together, this suggests that simple prompting variations might not be as effective as previously assumed, especially for difficult problems. However, as reported previously (Meincke et al. 2025a), prompting approaches can yield significantly different results for individual questions.

Via

Access Paper or Ask Questions

Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Jun 08, 2025

Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Abstract:This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to "think step by step" (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things: - The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers. - For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response.

Via

Access Paper or Ask Questions

Prompting Science Report 1: Prompt Engineering is Complicated and Contingent

Mar 04, 2025

Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Figure 1 for Prompting Science Report 1: Prompt Engineering is Complicated and Contingent

Figure 2 for Prompting Science Report 1: Prompt Engineering is Complicated and Contingent

Abstract:This is the first of a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we demonstrate two things: - There is no single standard for measuring whether a Large Language Model (LLM) passes a benchmark, and that choosing a standard has a big impact on how well the LLM does on that benchmark. The standard you choose will depend on your goals for using an LLM in a particular case. - It is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Specifically, we find that sometimes being polite to the LLM helps performance, and sometimes it lowers performance. We also find that constraining the AI's answers helps performance in some cases, though it may lower performance in other cases. Taken together, this suggests that benchmarking AI performance is not one-size-fits-all, and also that particular prompting formulas or approaches, like being polite to the AI, are not universally valuable.

Via

Access Paper or Ask Questions