Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Felix Steffek

Shortcut Learning in Legal Judgment Prediction: Empirical Evidence from the UK Employment Tribunal

Jul 05, 2026

Joe Watson, Joana Ribeiro de Faria, Marcus Tomalin, Måns Magnusson, Huiyuan Xie, Hao Tian Yeung, Felix Steffek

Abstract:Current Legal Judgment Prediction (LJP) is constrained by its reliance on post-hoc judicial materials, increasing the likelihood that models perform retrospective classification rather than true forecasting. This paper empirically investigates shortcut learning in this context by studying claim-level outcome prediction in UK Employment Tribunal (UKET) decisions. Using a corpus of 33,158 individual claims, we predict outcomes from claim texts and LLM-extracted case summaries, evaluating models ranging from interpretable TF-IDF-based classifiers to black-box LLMs. While headline predictive performance figures appear strong, we demonstrate that such performance in LJP systems trained on post-hoc judicial text can be driven by the retrospective nature of the source material. Stratifying the test data by human judgments of leakage reveals that performance increases where outcome-revealing cues are embedded in the narrative. Moreover, a model trained on just the 4% of features identified as leakage achieves high performance, outperforming human experts. These findings substantiate concerns that LJP performance may be exaggerated by linguistic artefacts. Yet this vulnerability is not fatal to the research agenda. Instead, post-hoc judgments might be treated as potentially contaminated texts, requiring active auditing. Retraining models after masking leakage features results in only a negligible reduction in Macro-F1. Hence, while models will opportunistically exploit shortcuts when available, they remain capable of extracting useful predictive signals when these artefacts are removed.

Via

Access Paper or Ask Questions

Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning

Jun 25, 2026

Stanisław Sójka, Felix Steffek, Matthias Grabmair

Abstract:Legal outcome prediction must disentangle objective case facts from adjudicative context. Merit-based rulings rely on factual evidence while technical disposals may hinge on judicial discretion. We propose a Judge-Aware Gated Multi-Task Learning architecture that explicitly models this distinction. We introduce a fine-grained outcome taxonomy to supervise the encoder, enforcing a structural regularization that disentangles distinct semantic pathways. This granular legal curriculum enables our Gated Fusion mechanism to dynamically modulate reliance on judge identity. We evaluate our approach on 13,937 UK Employment Tribunal decisions. We benchmark our design against supervised fine-tuning (SFT) of a Gemma-4 26B-A4B backbone, in which judge identity and the taxonomy are injected as prompt tokens or autoregressive output targets. The two contextual signals compose only weakly when forced through a single autoregressive channel. In contrast, coupling a LoRA-adapted Gemma-4 encoder with our gated architecture defines a new state of the art on this benchmark while requiring an order of magnitude fewer trainable parameters than the generative SFT baselines, with gains concentrated on the most ambiguous and rarest outcome classes. Beyond accuracy, the architecture is interpretable; learned judge embeddings and calibration profiles localize the cases where adjudicative context drives the prediction. These results indicate that, for identity-conditioned classification of legal outcomes, the choice of conditioning interface dominates scale: differentiable structured composition yields more accurate, more parameter-efficient models than prompt-based composition over a substantially larger backbone.

* 17 pages (8 pages main text), 5 figures, 9 tables. Accepted to the AI for Law Workshop at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea

Via

Access Paper or Ask Questions

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

Oct 01, 2025

Bill Marino, Rosco Hunter, Zubair Jamali, Marinos Emmanouil Kalpakos, Mudra Kashyap, Isaiah Hinton, Alexa Hanson, Maahum Nazir, Christoph Schnabl, Felix Steffek(+2 more)

Abstract:As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

Via

Access Paper or Ask Questions

The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal

Sep 12, 2024

Huiyuan Xie, Felix Steffek, Joana Ribeiro de Faria, Christine Carter, Jonathan Rutherford

Figure 1 for The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal

Figure 2 for The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal

Figure 3 for The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal

Figure 4 for The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal

Abstract:This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET). To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset. The dataset consists of approximately 19,000 UKET cases and their metadata. Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes. Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET. Human predictions are collected to establish a performance reference for model comparison. Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task. The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples. We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.

Via

Access Paper or Ask Questions

Topic Modelling Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment

May 21, 2024

Holli Sargeant, Ahmed Izzidien, Felix Steffek

Abstract:This paper addresses a critical gap in legal analytics by developing and applying a novel taxonomy for topic modelling summary judgment cases in the United Kingdom. Using a curated dataset of summary judgment cases, we use the Large Language Model Claude 3 Opus to explore functional topics and trends. We find that Claude 3 Opus correctly classified the topic with an accuracy of 87.10%. The analysis reveals distinct patterns in the application of summary judgments across various legal domains. As case law in the United Kingdom is not originally labelled with keywords or a topic filtering option, the findings not only refine our understanding of the thematic underpinnings of summary judgments but also illustrate the potential of combining traditional and AI-driven approaches in legal classification. Therefore, this paper provides a new and general taxonomy for UK law. The implications of this work serve as a foundation for further research and policy discussions in the field of judicial administration and computational legal research methodologies.

Via

Access Paper or Ask Questions

Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models

Mar 19, 2024

Joana Ribeiro de Faria, Huiyuan Xie, Felix Steffek

Abstract:Court transcripts and judgments are rich repositories of legal knowledge, detailing the intricacies of cases and the rationale behind judicial decisions. The extraction of key information from these documents provides a concise overview of a case, crucial for both legal experts and the public. With the advent of large language models (LLMs), automatic information extraction has become increasingly feasible and efficient. This paper presents a comprehensive study on the application of GPT-4, a large language model, for automatic information extraction from UK Employment Tribunal (UKET) cases. We meticulously evaluated GPT-4's performance in extracting critical information with a manual verification process to ensure the accuracy and relevance of the extracted data. Our research is structured around two primary extraction tasks: the first involves a general extraction of eight key aspects that hold significance for both legal specialists and the general public, including the facts of the case, the claims made, references to legal statutes, references to precedents, general case outcomes and corresponding labels, detailed order and remedies and reasons for the decision. The second task is more focused, aimed at analysing three of those extracted features, namely facts, claims and outcomes, in order to facilitate the development of a tool capable of predicting the outcome of employment law disputes. Through our analysis, we demonstrate that LLMs like GPT-4 can obtain high accuracy in legal information extraction, highlighting the potential of LLMs in revolutionising the way legal information is processed and utilised, offering significant implications for legal research and practice.

Via

Access Paper or Ask Questions

LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset

Mar 04, 2024

Ahmed Izzidien, Holli Sargeant, Felix Steffek

Figure 1 for LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset

Figure 2 for LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset

Figure 3 for LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset

Figure 4 for LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset

Abstract:To undertake computational research of the law, efficiently identifying datasets of court decisions that relate to a specific legal issue is a crucial yet challenging endeavour. This study addresses the gap in the literature working with large legal corpora about how to isolate cases, in our case summary judgments, from a large corpus of UK court decisions. We introduce a comparative analysis of two computational methods: (1) a traditional natural language processing-based approach leveraging expert-generated keywords and logical operators and (2) an innovative application of the Claude 2 large language model to classify cases based on content-specific prompts. We use the Cambridge Law Corpus of 356,011 UK court decisions and determine that the large language model achieves a weighted F1 score of 0.94 versus 0.78 for keywords. Despite iterative refinement, the search logic based on keywords fails to capture nuances in legal language. We identify and extract 3,102 summary judgment cases, enabling us to map their distribution across various UK courts over a temporal span. The paper marks a pioneering step in employing advanced natural language processing to tackle core legal research tasks, demonstrating how these technologies can bridge systemic gaps and enhance the accessibility of legal information. We share the extracted dataset metrics to support further research on summary judgments.

* 36 pages, 13 figures. This work was funded by the Nuffield Foundation grant Access to Justice Through Artificial Intelligence. The views expressed are those of the authors and not necessarily of the Foundation. Visit www.nuffieldfoundation.org. We are grateful to Nicola Mathew for excellent research assistance

Via

Access Paper or Ask Questions

The Cambridge Law Corpus: A Corpus for Legal AI Research

Sep 22, 2023

Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek

Figure 1 for The Cambridge Law Corpus: A Corpus for Legal AI Research

Figure 2 for The Cambridge Law Corpus: A Corpus for Legal AI Research

Figure 3 for The Cambridge Law Corpus: A Corpus for Legal AI Research

Figure 4 for The Cambridge Law Corpus: A Corpus for Legal AI Research

Abstract:We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.

* Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023

Via

Access Paper or Ask Questions