Abstract:Entity matching (EM), the task of identifying whether two descriptions refer to the same entity, is essential in data management. Traditional methods have evolved from rule-based to AI-driven approaches, yet current techniques using large language models (LLMs) often fall short due to their reliance on static knowledge and rigid, predefined prompts. In this paper, we introduce Libem, a compound AI system designed to address these limitations by incorporating a flexible, tool-oriented approach. Libem supports entity matching through dynamic tool use, self-refinement, and optimization, allowing it to adapt and refine its process based on the dataset and performance metrics. Unlike traditional solo-AI EM systems, which often suffer from a lack of modularity that hinders iterative design improvements and system optimization, Libem offers a composable and reusable toolchain. This approach aims to contribute to ongoing discussions and developments in AI-driven data management.
Abstract:Schema evolution is critical in managing database systems to ensure compatibility across different data versions. A schema registry typically addresses the challenges of schema evolution in real-time data streaming by managing, validating, and ensuring schema compatibility. However, current schema registries struggle with complex syntactic alterations like field renaming or type changes, which often require significant manual intervention and can disrupt service. To enhance the flexibility of schema evolution, we propose the use of generalized schema evolution (GSE) facilitated by a compound AI system. This system employs Large Language Models (LLMs) to interpret the semantics of schema changes, supporting a broader range of syntactic modifications without interrupting data streams. Our approach includes developing a task-specific language, Schema Transformation Language (STL), to generate schema mappings as an intermediate representation (IR), simplifying the integration of schema changes across different data processing platforms. Initial results indicate that this approach can improve schema mapping accuracy and efficiency, demonstrating the potential of GSE in practical applications.