Abstract:This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.
Abstract:As a result of transformation processes, the German labor market is highly dependent on vocational training, retraining and continuing education. To match training seekers and offers, we present a novel approach towards the automated detection of access to education and training in German training offers and advertisements. We will in particular focus on (a) general school and education degrees and schoolleaving certificates, (b) professional experience, (c) a previous apprenticeship and (d) a list of skills provided by the German Federal Employment Agency. This novel approach combines several methods: First, we provide a mapping of synonyms in education combining different qualifications and adding deprecated terms. Second, we provide a rule-based matching to identify the need for professional experience or apprenticeship. However, not all access requirements can be matched due to incompatible data schemata or non-standardizes requirements, e.g initial tests or interviews. While we can identify several shortcomings, the presented approach offers promising results for two data sets: training and re-training advertisements.




Abstract:Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the field of life sciences or bioinformatics. Although a lot of research has been done on the field of query optimization, query transformation and of course in storing and retrieving large scale knowledge graphs the field of algorithmic optimization is still a major challenge and a vital factor in using graph databases. Few researchers have addressed the problem of optimizing algorithms on large scale labeled property graphs. Here, we present two optimization approaches and compare them with a naive approach of directly querying the graph database. The aim of our work is to determine limiting factors of graph databases like Neo4j and we describe a novel solution to tackle these challenges. For this, we suggest a classification schema to differ between the complexity of a problem on a graph database. We evaluate our optimization approaches on a test system containing a knowledge graph derived biomedical publication data enriched with text mining data. This dense graph has more than 71M nodes and 850M relationships. The results are very encouraging and - depending on the problem - we were able to show a speedup of a factor between 44 and 3839.




Abstract:Here we present a holistic approach for data exploration on dense knowledge graphs as a novel approach with a proof-of-concept in biomedical research. Knowledge graphs are increasingly becoming a vital factor in knowledge mining and discovery as they connect data using technologies from the semantic web. In this paper we extend a basic knowledge graph extracted from biomedical literature by context data like named entities and relations obtained by text mining and other linked data sources like ontologies and databases. We will present an overview about this novel network. The aim of this work was to extend this current knowledge with approaches from graph theory. This method will build the foundation for quality control, validation of hypothesis, detection of missing data and time series analysis of biomedical knowledge in general. In this context we tried to apply multiple-valued decision diagrams to these questions. In addition this knowledge representation of linked data can be used as FAIR approach to answer semantic questions. This paper sheds new lights on dense and very large knowledge graphs and the importance of a graph-theoretic understanding of these networks.