Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jens Dörpinghaus

Noise-Aware Named Entity Recognition for Historical VET Documents

Jan 01, 2026

Alexander M. Esser, Jens Dörpinghaus

Abstract:This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.

* This is an extended, non-peer-reviewed version of the paper presented at VISAPP 2026

Via

Access Paper or Ask Questions

Rule-based detection of access to education and training in Germany

Apr 13, 2023

Jens Dörpinghaus, David Samray, Robert Helmrich

Abstract:As a result of transformation processes, the German labor market is highly dependent on vocational training, retraining and continuing education. To match training seekers and offers, we present a novel approach towards the automated detection of access to education and training in German training offers and advertisements. We will in particular focus on (a) general school and education degrees and schoolleaving certificates, (b) professional experience, (c) a previous apprenticeship and (d) a list of skills provided by the German Federal Employment Agency. This novel approach combines several methods: First, we provide a mapping of synonyms in education combining different qualifications and adding deprecated terms. Second, we provide a rule-based matching to identify the need for professional experience or apprenticeship. However, not all access requirements can be matched due to incompatible data schemata or non-standardizes requirements, e.g initial tests or interviews. While we can identify several shortcomings, the presented approach offers promising results for two data sets: training and re-training advertisements.

Via

Access Paper or Ask Questions

Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

Feb 10, 2020

Jens Dörpinghaus, Andreas Stefan

Figure 1 for Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

Figure 2 for Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

Figure 3 for Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

Figure 4 for Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

Abstract:Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the field of life sciences or bioinformatics. Although a lot of research has been done on the field of query optimization, query transformation and of course in storing and retrieving large scale knowledge graphs the field of algorithmic optimization is still a major challenge and a vital factor in using graph databases. Few researchers have addressed the problem of optimizing algorithms on large scale labeled property graphs. Here, we present two optimization approaches and compare them with a naive approach of directly querying the graph database. The aim of our work is to determine limiting factors of graph databases like Neo4j and we describe a novel solution to tackle these challenges. For this, we suggest a classification schema to differ between the complexity of a problem on a graph database. We evaluate our optimization approaches on a test system containing a knowledge graph derived biomedical publication data enriched with text mining data. This dense graph has more than 71M nodes and 850M relationships. The results are very encouraging and - depending on the problem - we were able to show a speedup of a factor between 44 and 3839.

Via

Access Paper or Ask Questions

Data Exploration and Validation on dense knowledge graphs for biomedical research

Dec 08, 2019

Jens Dörpinghaus, Alexander Apke, Vanessa Lage-Rupprecht, Andreas Stefan

Figure 1 for Data Exploration and Validation on dense knowledge graphs for biomedical research

Figure 2 for Data Exploration and Validation on dense knowledge graphs for biomedical research

Figure 3 for Data Exploration and Validation on dense knowledge graphs for biomedical research

Figure 4 for Data Exploration and Validation on dense knowledge graphs for biomedical research

Abstract:Here we present a holistic approach for data exploration on dense knowledge graphs as a novel approach with a proof-of-concept in biomedical research. Knowledge graphs are increasingly becoming a vital factor in knowledge mining and discovery as they connect data using technologies from the semantic web. In this paper we extend a basic knowledge graph extracted from biomedical literature by context data like named entities and relations obtained by text mining and other linked data sources like ontologies and databases. We will present an overview about this novel network. The aim of this work was to extend this current knowledge with approaches from graph theory. This method will build the foundation for quality control, validation of hypothesis, detection of missing data and time series analysis of biomedical knowledge in general. In this context we tried to apply multiple-valued decision diagrams to these questions. In addition this knowledge representation of linked data can be used as FAIR approach to answer semantic questions. This paper sheds new lights on dense and very large knowledge graphs and the importance of a graph-theoretic understanding of these networks.

Via

Access Paper or Ask Questions