Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noy Sternlicht

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Jun 05, 2025

Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim

Abstract:We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

* Code: https://github.com/noy-sternlicht/Debatable-Intelligence

Via

Access Paper or Ask Questions

CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

May 28, 2025

Noy Sternlicht, Tom Hope

Figure 1 for CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Figure 2 for CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Figure 3 for CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Figure 4 for CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Abstract:A hallmark of human innovation is the process of recombination -- creating original ideas by integrating elements of existing mechanisms and concepts. In this work, we automatically mine the scientific literature and build CHIMERA: a large-scale knowledge base (KB) of recombination examples. CHIMERA can be used to empirically explore at scale how scientists recombine concepts and take inspiration from different areas, or to train supervised machine learning models that learn to predict new creative cross-domain directions. To build this KB, we present a novel information extraction task of extracting recombination from scientific paper abstracts, collect a high-quality corpus of hundreds of manually annotated abstracts, and use it to train an LLM-based extraction model. The model is applied to a large corpus of papers in the AI domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to explore the properties of recombination in different subareas of AI. Finally, we train a scientific hypothesis generation model using the KB, which predicts new recombination directions that real-world researchers find inspiring. Our data and code are available at https://github.com/noy-sternlicht/CHIMERA-KB

* Project page: https://noy-sternlicht.github.io/CHIMERA-Web

Via

Access Paper or Ask Questions

In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis

May 20, 2025

Hiba Arnaout, Noy Sternlicht, Tom Hope, Iryna Gurevych

Abstract:Understanding the impact of scientific publications is crucial for identifying breakthroughs and guiding future research. Traditional metrics based on citation counts often miss the nuanced ways a paper contributes to its field. In this work, we propose a new task: generating nuanced, expressive, and time-aware impact summaries that capture both praise (confirmation citations) and critique (correction citations) through the evolution of fine-grained citation intents. We introduce an evaluation framework tailored to this task, showing moderate to strong human correlation on subjective metrics such as insightfulness. Expert feedback from professors reveals a strong interest in these summaries and suggests future improvements.

Via

Access Paper or Ask Questions