Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roger Waldvogel

Slot Filling for Extracting Reskilling and Upskilling Options from the Web

Jul 11, 2022

Albert Weichselbraun, Roger Waldvogel, Andreas Fraefel, Alexander van Schie, Philipp Kuntschik

Abstract:Disturbances in the job market such as advances in science and technology, crisis and increased competition have triggered a surge in reskilling and upskilling programs. Information on suitable continuing education options is distributed across many sites, rendering the search, comparison and selection of useful programs a cumbersome task. This paper, therefore, introduces a knowledge extraction system that integrates reskilling and upskilling options into a single knowledge graph. The system collects educational programs from 488 different providers and uses context extraction for identifying and contextualizing relevant content. Afterwards, entity recognition and entity linking methods draw upon a domain ontology to locate relevant entities such as skills, occupations and topics. Finally, slot filling integrates entities based on their context into the corresponding slots of the continuous education knowledge graph. We also introduce a German gold standard that comprises 169 documents and over 3800 annotations for benchmarking the necessary content extraction, entity linking, entity recognition and slot filling tasks, and provide an overview of the system's performance.

* Natural Language Processing and Information Systems (NLDB 2022). This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in "27th International Conference on Applications of Natural Language to Information Systems (NLDB 2022), Valencia, Spain, June 15-17, 2022, Proceedings", and is available online at https://doi.org/10.1007/978-3-031-08473-7_25

Via

Access Paper or Ask Questions

Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Feb 03, 2021

Albert Weichselbraun, Adrian M. P. Brasoveanu, Roger Waldvogel, Fabian Odoni

Figure 1 for Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Figure 2 for Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Figure 3 for Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Figure 4 for Harvest -- An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Abstract:Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure. This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are extracted. We also introduce Harvest, a new open source toolkit that implements the presented methods and create a gold standard extracted from 52 different Web forums for evaluating our approach. A comprehensive evaluation reveals that Harvest clearly outperforms competing systems.

* IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2020), Accepted 27 October 2020

Via

Access Paper or Ask Questions