Picture for Philipp Mondorf

Philipp Mondorf

LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

Add code
Feb 06, 2026
Viaarxiv icon

Unravelling the Mechanisms of Manipulating Numbers in Language Models

Add code
Oct 30, 2025
Figure 1 for Unravelling the Mechanisms of Manipulating Numbers in Language Models
Figure 2 for Unravelling the Mechanisms of Manipulating Numbers in Language Models
Figure 3 for Unravelling the Mechanisms of Manipulating Numbers in Language Models
Figure 4 for Unravelling the Mechanisms of Manipulating Numbers in Language Models
Viaarxiv icon

Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

Add code
May 26, 2025
Viaarxiv icon

Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality

Add code
Apr 02, 2025
Viaarxiv icon

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Add code
Feb 17, 2025
Viaarxiv icon

Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination

Add code
Oct 24, 2024
Figure 1 for Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination
Figure 2 for Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination
Figure 3 for Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination
Figure 4 for Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination
Viaarxiv icon

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

Add code
Oct 02, 2024
Viaarxiv icon

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Add code
Jun 26, 2024
Figure 1 for LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Figure 2 for LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Figure 3 for LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Figure 4 for LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Viaarxiv icon

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models

Add code
Jun 18, 2024
Viaarxiv icon

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

Add code
Apr 02, 2024
Viaarxiv icon