Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various domains, yet they struggle with knowledge-intensive tasks in areas that demand factual accuracy, e.g. industrial automation and healthcare. Key limitations include their tendency to hallucinate, lack of source traceability (provenance), and challenges in timely knowledge updates. Combining language models with knowledge graphs (GraphRAG) offers promising avenues for overcoming these deficits. However, a major challenge lies in creating such a knowledge graph in the first place. Here, we propose a novel approach that combines LLMs with a tripartite knowledge graph representation, which is constructed by connecting complex, domain-specific objects via a curated ontology of corresponding, domain-specific concepts to relevant sections within chunks of text through a concept-anchored pre-analysis of source documents starting from an initial lexical graph. As a consequence, our Tripartite-GraphRAG approach implements: i) a concept-specific, information-preserving pre-compression of textual chunks; ii) allows for the formation of a concept-specific relevance estimation of embedding similarities grounded in statistics; and iii) avoids common challenges w.r.t. continuous extendability, such as the need for entity resolution and deduplication. By applying a transformation to the knowledge graph, we formulate LLM prompt creation as an unsupervised node classification problem, drawing on ideas from Markov Random Fields. We evaluate our approach on a healthcare use case, involving multi-faceted analyses of patient anamneses given a set of medical concepts as well as clinical literature. Experiments indicate that it can optimize information density, coverage, and arrangement of LLM prompts while reducing their lengths, which may lead to reduced costs and more consistent and reliable LLM outputs.
Abstract:In this paper we give instructions on how to write a minimalist grammar (MG). In order to present the instructions as an algorithm, we use a variant of context free grammars (CFG) as an input format. We can exclude overgeneration, if the CFG has no recursion, i.e. no non-terminal can (indirectly) derive to a right-hand side containing itself. The constructed MGs utilize licensors/-ees as a special way of exception handling. A CFG format for a derivation $A\_eats\_B\mapsto^* peter\_eats\_apples$, where $A$ and $B$ generate noun phrases, normally leads to overgeneration, e.\,g., $i\_eats\_apples$. In order to avoid overgeneration, a CFG would need many non-terminal symbols and rules, that mainly produce the same word, just to handle exceptions. In our MGs however, we can summarize CFG rules that produce the same word in one item and handle exceptions by a proper distribution of licensees/-ors. The difficulty with this technique is that in most generations the majority of licensees/-ors is not needed, but still has to be triggered somehow. We solve this problem with $\epsilon$-items called \emph{adapters}.