Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marieke van Erp

Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Nov 13, 2025

Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp

Figure 1 for Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Figure 2 for Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Figure 3 for Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Figure 4 for Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Abstract:Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

* 46 pages

Via

Access Paper or Ask Questions

Explicit vs. Implicit Biographies: Evaluating and Adapting LLM Information Extraction on Wikidata-Derived Texts

Sep 18, 2025

Alessandra Stramiglio, Andrea Schimmenti, Valentina Pasqual, Marieke van Erp, Francesco Sovrano, Fabio Vitali

Abstract:Text Implicitness has always been challenging in Natural Language Processing (NLP), with traditional methods relying on explicit statements to identify entities and their relationships. From the sentence "Zuhdi attends church every Sunday", the relationship between Zuhdi and Christianity is evident for a human reader, but it presents a challenge when it must be inferred automatically. Large language models (LLMs) have proven effective in NLP downstream tasks such as text comprehension and information extraction (IE). This study examines how textual implicitness affects IE tasks in pre-trained LLMs: LLaMA 2.3, DeepSeekV1, and Phi1.5. We generate two synthetic datasets of 10k implicit and explicit verbalization of biographic information to measure the impact on LLM performance and analyze whether fine-tuning implicit data improves their ability to generalize in implicit reasoning tasks. This research presents an experiment on the internal reasoning processes of LLMs in IE, particularly in dealing with implicit and explicit contexts. The results demonstrate that fine-tuning LLM models with LoRA (low-rank adaptation) improves their performance in extracting information from implicit texts, contributing to better model interpretability and reliability.

Via

Access Paper or Ask Questions

The Wind in Our Sails: Developing a Reusable and Maintainable Dutch Maritime History Knowledge Graph

Nov 10, 2021

Stijn Schouten, Victor de Boer, Lodewijk Petram, Marieke van Erp

Figure 1 for The Wind in Our Sails: Developing a Reusable and Maintainable Dutch Maritime History Knowledge Graph

Figure 2 for The Wind in Our Sails: Developing a Reusable and Maintainable Dutch Maritime History Knowledge Graph

Figure 3 for The Wind in Our Sails: Developing a Reusable and Maintainable Dutch Maritime History Knowledge Graph

Abstract:Digital sources are more prevalent than ever but effectively using them can be challenging. One core challenge is that digitized sources are often distributed, thus forcing researchers to spend time collecting, interpreting, and aligning different sources. A knowledge graph can accelerate research by providing a single connected source of truth that humans and machines can query. During two design-test cycles, we convert four data sets from the historical maritime domain into a knowledge graph. The focus during these cycles is on creating a sustainable and usable approach that can be adopted in other linked data conversion efforts. Furthermore, our knowledge graph is available for maritime historians and other interested users to investigate the daily business of the Dutch East India Company through a unified portal.

* Accepted to K-CAP'21, December 2-3, 2021, Virtual Event, USA

Via

Access Paper or Ask Questions

Marriage is a Peach and a Chalice: Modelling Cultural Symbolism on the SemanticWeb

Nov 03, 2021

Bruno Sartini, Marieke van Erp, Aldo Gangemi

Figure 1 for Marriage is a Peach and a Chalice: Modelling Cultural Symbolism on the SemanticWeb

Figure 2 for Marriage is a Peach and a Chalice: Modelling Cultural Symbolism on the SemanticWeb

Figure 3 for Marriage is a Peach and a Chalice: Modelling Cultural Symbolism on the SemanticWeb

Figure 4 for Marriage is a Peach and a Chalice: Modelling Cultural Symbolism on the SemanticWeb

Abstract:In this work, we fill the gap in the Semantic Web in the context of Cultural Symbolism. Building upon earlier work in, we introduce the Simulation Ontology, an ontology that models the background knowledge of symbolic meanings, developed by combining the concepts taken from the authoritative theory of Simulacra and Simulations of Jean Baudrillard with symbolic structures and content taken from "Symbolism: a Comprehensive Dictionary" by Steven Olderr. We re-engineered the symbolic knowledge already present in heterogeneous resources by converting it into our ontology schema to create HyperReal, the first knowledge graph completely dedicated to cultural symbolism. A first experiment run on the knowledge graph is presented to show the potential of quantitative research on symbolism.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions

Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Dec 22, 2020

Nacira Abbas, Kholoud Alghamdi, Mortaza Alinam, Francesca Alloatti, Glenda Amaral, Claudia d'Amato, Luigi Asprino, Martin Beno, Felix Bensmann, Russa Biswas(+64 more)

Figure 1 for Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Figure 2 for Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Figure 3 for Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Figure 4 for Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Abstract:One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this further by asking if we can create a knowledge graph of "everything" ranging from common sense concepts to location based entities. This knowledge graph should be "open to the public" in a FAIR manner democratizing this mass amount of knowledge." Although linked open data (LOD) is one knowledge graph, it is the closest realisation (and probably the only one) to a public FAIR Knowledge Graph (KG) of everything. Surely, LOD provides a unique testbed for experimenting and evaluating research hypotheses on open and FAIR KG. One of the most neglected FAIR issues about KGs is their ongoing evolution and long term preservation. We want to investigate this problem, that is to understand what preserving and supporting the evolution of KGs means and how these problems can be addressed. Clearly, the problem can be approached from different perspectives and may require the development of different approaches, including new theories, ontologies, metrics, strategies, procedures, etc. This document reports a collaborative effort performed by 9 teams of students, each guided by a senior researcher as their mentor, attending the International Semantic Web Research School (ISWS 2019). Each team provides a different perspective to the problem of knowledge graph evolution substantiated by a set of research questions as the main subject of their investigation. In addition, they provide their working definition for KG preservation and evolution.

Via

Access Paper or Ask Questions

Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

Dec 06, 2020

Ryan Brate, Paul Groth, Marieke van Erp

Figure 1 for Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

Figure 2 for Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

Figure 3 for Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

Figure 4 for Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

Abstract:Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them `smell experiences', offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper, we present two variations on a semi-supervised approach to identify smell experiences in English literature. The combined set of patterns from both implementations offer significantly better performance than a keyword-based baseline.

* Accepted to The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2020). Barcelona, Spain. December 2020./

Via

Access Paper or Ask Questions

Ontologies in CLARIAH: Towards Interoperability in History, Language and Media

Apr 06, 2020

Albert Meroño-Peñuela, Victor de Boer, Marieke van Erp, Willem Melder, Rick Mourits, Auke Rijpma, Ruben Schalk, Richard Zijdeman

Figure 1 for Ontologies in CLARIAH: Towards Interoperability in History, Language and Media

Figure 2 for Ontologies in CLARIAH: Towards Interoperability in History, Language and Media

Figure 3 for Ontologies in CLARIAH: Towards Interoperability in History, Language and Media

Figure 4 for Ontologies in CLARIAH: Towards Interoperability in History, Language and Media

Abstract:One of the most important goals of digital humanities is to provide researchers with data and tools for new research questions, either by increasing the scale of scholarly studies, linking existing databases, or improving the accessibility of data. Here, the FAIR principles provide a useful framework as these state that data needs to be: Findable, as they are often scattered among various sources; Accessible, since some might be offline or behind paywalls; Interoperable, thus using standard knowledge representation formats and shared vocabularies; and Reusable, through adequate licensing and permissions. Integrating data from diverse humanities domains is not trivial, research questions such as "was economic wealth equally distributed in the 18th century?", or "what are narratives constructed around disruptive media events?") and preparation phases (e.g. data collection, knowledge organisation, cleaning) of scholars need to be taken into account. In this chapter, we describe the ontologies and tools developed and integrated in the Dutch national project CLARIAH to address these issues across datasets from three fundamental domains or "pillars" of the humanities (linguistics, social and economic history, and media studies) that have paradigmatic data representations (textual corpora, structured data, and multimedia). We summarise the lessons learnt from using such ontologies and tools in these domains from a generalisation and reusability perspective.

* 26 pages

Via

Access Paper or Ask Questions

Analysis of Named Entity Recognition and Linking for Tweets

Oct 27, 2014

Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, Kalina Bontcheva

Figure 1 for Analysis of Named Entity Recognition and Linking for Tweets

Figure 2 for Analysis of Named Entity Recognition and Linking for Tweets

Figure 3 for Analysis of Named Entity Recognition and Linking for Tweets

Figure 4 for Analysis of Named Entity Recognition and Linking for Tweets

Abstract:Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

* Information Processing & Management 51 (2), 32-49, 2014
* 35 pages, accepted to journal Information Processing and Management

Via

Access Paper or Ask Questions