Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul Groth

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Sep 09, 2024

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

Figure 1 for AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Figure 2 for AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Figure 3 for AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Figure 4 for AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Abstract:Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).

* 12 pages excluding references, 3 figures, and 5 tables

Via

Access Paper or Ask Questions

Explaining Graph Neural Networks for Node Similarity on Graphs

Jul 10, 2024

Daniel Daza, Cuong Xuan Chu, Trung-Kien Tran, Daria Stepanova, Michael Cochez, Paul Groth

Abstract:Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable similarity search over graphs, by investigating how GNN-based methods for computing node similarities can be augmented with explanations. Specifically, we evaluate the performance of two prominent approaches towards explanations in GNNs, based on the concepts of mutual information (MI), and gradient-based explanations (GB). We discuss their suitability and empirically validate the properties of their explanations over different popular graph benchmarks. We find that unlike MI explanations, gradient-based explanations have three desirable properties. First, they are actionable: selecting inputs depending on them results in predictable changes in similarity scores. Second, they are consistent: the effect of selecting certain inputs overlaps very little with the effect of discarding them. Third, they can be pruned significantly to obtain sparse explanations that retain the effect on similarity scores.

Via

Access Paper or Ask Questions

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Apr 30, 2024

Stefan Grafberger, Paul Groth, Sebastian Schelter

Figure 1 for Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Figure 2 for Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Figure 3 for Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Abstract:Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.

Via

Access Paper or Ask Questions

SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Apr 04, 2024

Bradley P. Allen, Fina Polat, Paul Groth

Figure 1 for SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Figure 2 for SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Figure 3 for SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Figure 4 for SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Abstract:We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system's classification decisions were consistent with those of the crowd-sourced human labellers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github.

* 6 pages, 6 figures, 4 tables, camera-ready copy, accepted to the 18th International Workshop on Semantic Evaluation (SemEval-2024), for associated code and data see https://github.com/bradleypallen/shroom

Via

Access Paper or Ask Questions

AE SemRL: Learning Semantic Association Rules with Autoencoders

Mar 26, 2024

Erkan Karabulut, Victoria Degeler, Paul Groth

Figure 1 for AE SemRL: Learning Semantic Association Rules with Autoencoders

Figure 2 for AE SemRL: Learning Semantic Association Rules with Autoencoders

Figure 3 for AE SemRL: Learning Semantic Association Rules with Autoencoders

Figure 4 for AE SemRL: Learning Semantic Association Rules with Autoencoders

Abstract:Association Rule Mining (ARM) is the task of learning associations among data features in the form of logical rules. Mining association rules from high-dimensional numerical data, for example, time series data from a large number of sensors in a smart environment, is a computationally intensive task. In this study, we propose an Autoencoder-based approach to learn and extract association rules from time series data (AE SemRL). Moreover, we argue that in the presence of semantic information related to time series data sources, semantics can facilitate learning generalizable and explainable association rules. Despite enriching time series data with additional semantic features, AE SemRL makes learning association rules from high-dimensional data feasible. Our experiments show that semantic association rules can be extracted from a latent representation created by an Autoencoder and this method has in the order of hundreds of times faster execution time than state-of-the-art ARM approaches in many scenarios. We believe that this study advances a new way of extracting associations from representations and has the potential to inspire more research in this field.

Via

Access Paper or Ask Questions

AdaTyper: Adaptive Semantic Column Type Detection

Nov 23, 2023

Madelon Hulsebos, Paul Groth, Çağatay Demiralp

Figure 1 for AdaTyper: Adaptive Semantic Column Type Detection

Figure 2 for AdaTyper: Adaptive Semantic Column Type Detection

Figure 3 for AdaTyper: Adaptive Semantic Column Type Detection

Figure 4 for AdaTyper: Adaptive Semantic Column Type Detection

Abstract:Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a gap between this performance and its applicability in practice. In this paper, we propose AdaTyper to address one of the most critical deployment challenges: adaptation. AdaTyper uses weak-supervision to adapt a hybrid type predictor towards new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor of AdaTyper combines rule-based methods and a light machine learning model for semantic column type detection. We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries.

* Submitted to VLDB'24

Via

Access Paper or Ask Questions

Semantic Association Rule Learning from Time Series Data and Knowledge Graphs

Oct 11, 2023

Erkan Karabulut, Victoria Degeler, Paul Groth

Abstract:Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time series data. In addition to this initial pipeline, we also propose new semantic association rule criterion. The approach is evaluated on an industrial water network scenario. Initial evaluation shows that the proposed approach is able to learn a high number of association rules with semantic information which are more generalizable. The paper aims to set a foundation for further work on using semantic association rule learning especially in the context of industrial applications.

* This paper is accepted to SemIIM23: 2nd International Workshop on Semantic Industrial Information Modelling, 7th November 2023, Athens, Greece, co-located with 22nd International Semantic Web Conference (ISWC 2023)

Via

Access Paper or Ask Questions

Observatory: Characterizing Embeddings of Relational Tables

Oct 05, 2023

Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, H. V. Jagadish

Figure 1 for Observatory: Characterizing Embeddings of Relational Tables

Figure 2 for Observatory: Characterizing Embeddings of Relational Tables

Figure 3 for Observatory: Characterizing Embeddings of Relational Tables

Figure 4 for Observatory: Characterizing Embeddings of Relational Tables

Abstract:Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze seven such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.

* Under revision

Via

Access Paper or Ask Questions

Knowledge Engineering using Large Language Models

Oct 01, 2023

Bradley P. Allen, Lise Stork, Paul Groth

Figure 1 for Knowledge Engineering using Large Language Models

Figure 2 for Knowledge Engineering using Large Language Models

Abstract:Knowledge engineering is a discipline that focuses on the creation and maintenance of processes that generate and apply knowledge. Traditionally, knowledge engineering approaches have focused on knowledge expressed in formal languages. The emergence of large language models and their capabilities to effectively work with natural language, in its broadest sense, raises questions about the foundations and practice of knowledge engineering. Here, we outline the potential role of LLMs in knowledge engineering, identifying two central directions: 1) creating hybrid neuro-symbolic knowledge systems; and 2) enabling knowledge engineering in natural language. Additionally, we formulate key open research questions to tackle these directions.

* 19 pages, 2 figures, accepted in Transactions on Graph Data and Knowledge

Via

Access Paper or Ask Questions

Ontologies in Digital Twins: A Systematic Literature Review

Aug 29, 2023

Erkan Karabulut, Salvatore F. Pileggi, Paul Groth, Victoria Degeler

Figure 1 for Ontologies in Digital Twins: A Systematic Literature Review

Figure 2 for Ontologies in Digital Twins: A Systematic Literature Review

Figure 3 for Ontologies in Digital Twins: A Systematic Literature Review

Figure 4 for Ontologies in Digital Twins: A Systematic Literature Review

Abstract:Digital Twins (DT) facilitate monitoring and reasoning processes in cyber-physical systems. They have progressively gained popularity over the past years because of intense research activity and industrial advancements. Cognitive Twins is a novel concept, recently coined to refer to the involvement of Semantic Web technology in DTs. Recent studies address the relevance of ontologies and knowledge graphs in the context of DTs, in terms of knowledge representation, interoperability and automatic reasoning. However, there is no comprehensive analysis of how semantic technologies, and specifically ontologies, are utilized within DTs. This Systematic Literature Review (SLR) is based on the analysis of 82 research articles, that either propose or benefit from ontologies with respect to DT. The paper uses different analysis perspectives, including a structural analysis based on a reference DT architecture, and an application-specific analysis to specifically address the different domains, such as Manufacturing and Infrastructure. The review also identifies open issues and possible research directions on the usage of ontologies and knowledge graphs in DTs.

* The Systematic Literature Review (SLR) is submitted to Future Generation Computer System journal's Special Issue on Digital Twin for Future Networks and Emerging IoT Applications (2023)

Via

Access Paper or Ask Questions