Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

He Tan

From Prompt to Graph: Comparing LLM-Based Information Extraction Strategies in Domain-Specific Ontology Development

Jan 31, 2026

Xuan Liu, Ziyu Li, Mu He, Ziyang Ma, Xiaoxu Wu, Gizem Yilmaz, Yiyuan Xia, Bingbing Li, He Tan, Jerry Ying Hsi Fuh(+3 more)

Abstract:Ontologies are essential for structuring domain knowledge, improving accessibility, sharing, and reuse. However, traditional ontology construction relies on manual annotation and conventional natural language processing (NLP) techniques, making the process labour-intensive and costly, especially in specialised fields like casting manufacturing. The rise of Large Language Models (LLMs) offers new possibilities for automating knowledge extraction. This study investigates three LLM-based approaches, including pre-trained LLM-driven method, in-context learning (ICL) method and fine-tuning method to extract terms and relations from domain-specific texts using limited data. We compare their performances and use the best-performing method to build a casting ontology that validated by domian expert.

* 11 pages,8 figures,3 tables,presented at International Conference on Industry of the Future and Smart Manufacturing,2025

Via

Access Paper or Ask Questions

CODE-ACCORD: A Corpus of Building Regulatory Data for Rule Generation towards Automatic Compliance Checking

Mar 04, 2024

Hansi Hettiarachchi, Amna Dridi, Mohamed Medhat Gaber, Pouyan Parsafard, Nicoleta Bocaneala, Katja Breitenfelder, Gonçal Costa, Maria Hedblom, Mihaela Juganaru-Mathieu, Thamer Mecharnia(+4 more)

Abstract:Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. However, extracting information from textual rules to convert them to a machine-readable format has been a challenge due to the complexities associated with natural language and the limited resources that can support advanced machine-learning techniques. To address this challenge, we introduce CODE-ACCORD, a unique dataset compiled under the EU Horizon ACCORD project. CODE-ACCORD comprises 862 self-contained sentences extracted from the building regulations of England and Finland. Aligned with our core objective of facilitating information extraction from text for machine-readable rule generation, each sentence was annotated with entities and relations. Entities represent specific components such as "window" and "smoke detectors", while relations denote semantic associations between these entities, collectively capturing the conveyed ideas in natural language. We manually annotated all the sentences using a group of 12 annotators. Each sentence underwent annotations by multiple annotators and subsequently careful data curation to finalise annotations, ensuring their accuracy and reliability, thereby establishing the dataset as a solid ground truth. CODE-ACCORD offers a rich resource for diverse machine learning and natural language processing (NLP) related tasks in ACC, including text classification, entity recognition and relation extraction. To the best of our knowledge, this is the first entity and relation-annotated dataset in compliance checking, which is also publicly available.

* This is a preprint of an article submitted to the Data in Brief Journal, Elsevier

Via

Access Paper or Ask Questions

A study on the relation between linguistics-oriented and domain-specific semantics

Dec 07, 2010

He Tan

Abstract:In this paper we dealt with the comparison and linking between lexical resources with domain knowledge provided by ontologies. It is one of the issues for the combination of the Semantic Web Ontologies and Text Mining. We investigated the relations between the linguistics oriented and domain-specific semantics, by associating the GO biological process concepts to the FrameNet semantic frames. The result shows the gaps between the linguistics-oriented and domain-specific semantics on the classification of events and the grouping of target words. The result provides valuable information for the improvement of domain ontologies supporting for text mining systems. And also, it will result in benefits to language understanding technology.

* in Adrian Paschke, Albert Burger, Andrea Splendiani, M. Scott Marshall, Paolo Romano: Proceedings of the 3rd International Workshop on Semantic Web Applications and Tools for the Life Sciences, Berlin,Germany, December 8-10, 2010

Via

Access Paper or Ask Questions