Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robert E. Blackwell

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

May 18, 2026

Anthony G. Cohn, Robert E. Blackwell

Abstract:We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.

* 74 pages, 20 figures

Via

Access Paper or Ask Questions

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Oct 04, 2024

Robert E. Blackwell, Jon Barry, Anthony G. Cohn

Figure 1 for Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Figure 2 for Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Figure 3 for Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Figure 4 for Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Abstract:Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the time and cost of repeated experiments. We use benchmarks designed for testing LLMs' capacity to reason about cardinal directions to explore the impact of experimental repeats on mean score and prediction interval. We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score and make recommendations concerning reproducible LLM evaluation.

* 4 pages, 1 figure

Via

Access Paper or Ask Questions