Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kim Gerdes

LISN, Qatent

Citation-Driven Multi-View Training for Patent Embeddings: QaECTER and Sophia-Bench

Apr 24, 2026

Younes Djemmal, You Zuo, Kim Gerdes, Kirian Guiller

Abstract:Patent retrieval underpins critical decisions in innovation, examination, and IP strategy, yet progress has been hampered by the absence of benchmarks that reflect the diversity of real world search scenarios. We address this gap with two contributions. First, we introduce Sophiabench, a large-scale patent retrieval benchmark comprising 10,000 queries and 75,000 corpus documents stratified across ten years, eight IPC technology sections, and twelve filing jurisdictions. Unlike prior benchmarks, Sophia-bench tests retrieval using 12 different query types-from structured patent fields to AI-generated summaries-and evaluates results against citation-based ground truth enhanced with a novel domain-relevance metric (InScope). Together, these enable systematic measurement of how well models perform across query types, technology domains, and jurisdictions. Second, we introduce QaECTER, a 344M-parameter embedding model trained on patent citation graphs and multi-view self-alignment. Despite its compact size, QaECTER establishes a new state of the art for patent retrieval. It outperforms the \#1 model on the English retrieval text embedding benchmark (RTEB), a model 23x larger, as well as all existing patent specific models across every query type, IPC section, and jurisdiction on Sophia-bench, with gains of up to 7.2% average NDCG@10 over the next-best model. These results are confirmed on an independent external benchmark, where QaECTER surpasses all prior models without requiring task-specific instruction prompts. Both the benchmark and the model are designed for practical deployment in large-scale patent search systems.

Via

Access Paper or Ask Questions

PatentEval: Understanding Errors in Patent Generation

Jun 05, 2024

You Zuo, Kim Gerdes, Eric Villemonte de La Clergerie, Benoît Sagot

Figure 1 for PatentEval: Understanding Errors in Patent Generation

Figure 2 for PatentEval: Understanding Errors in Patent Generation

Figure 3 for PatentEval: Understanding Errors in Patent Generation

Figure 4 for PatentEval: Understanding Errors in Patent Generation

Abstract:In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.

* NAACL2024 - 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Jun 2024, Mexico City, Mexico

Via

Access Paper or Ask Questions

PatFig: Generating Short and Long Captions for Patent Figures

Sep 15, 2023

Dana Aubakirova, Kim Gerdes, Lufei Liu

Figure 1 for PatFig: Generating Short and Long Captions for Patent Figures

Figure 2 for PatFig: Generating Short and Long Captions for Patent Figures

Figure 3 for PatFig: Generating Short and Long Captions for Patent Figures

Figure 4 for PatFig: Generating Short and Long Captions for Patent Figures

Abstract:This paper introduces Qatent PatFig, a novel large-scale patent figure dataset comprising 30,000+ patent figures from over 11,000 European patent applications. For each figure, this dataset provides short and long captions, reference numerals, their corresponding terms, and the minimal claim set that describes the interactions between the components of the image. To assess the usability of the dataset, we finetune an LVLM model on Qatent PatFig to generate short and long descriptions, and we investigate the effects of incorporating various text-based cues at the prediction stage of the patent figure captioning process.

* accepted to the ICCV 2023, CLVL: 5th Workshop on Closing the Loop Between Vision and Language

Via

Access Paper or Ask Questions