Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Debayan Banerjee

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

May 19, 2026

Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck

Abstract:Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

* Accepted by NeSy 2026

Via

Access Paper or Ask Questions

DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph

Jul 30, 2025

Debayan Banerjee, Tilahun Abedissa Taffa, Ricardo Usbeck

Figure 1 for DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph

Figure 2 for DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph

Figure 3 for DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph

Abstract:In this work we present an entity linker for DBLP's 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the "yes" token output at the penultimate layer of the LLM.

Via

Access Paper or Ask Questions

Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset

Dec 05, 2024

Tilahun Abedissa Taffa, Debayan Banerjee, Yaregal Assabie, Ricardo Usbeck

Abstract:Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.

Via

Access Paper or Ask Questions

Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Jul 24, 2024

Aida Usmanova, Junbo Huang, Debayan Banerjee, Ricardo Usbeck

Figure 1 for Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Figure 2 for Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Figure 3 for Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Figure 4 for Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Abstract:Human-produced emissions are growing at an alarming rate, causing already observable changes in the climate and environment in general. Each year global carbon dioxide emissions hit a new record, and it is reported that 0.5% of total US greenhouse gas emissions are attributed to data centres as of 2021. The release of ChatGPT in late 2022 sparked social interest in Large Language Models (LLMs), the new generation of Language Models with a large number of parameters and trained on massive amounts of data. Currently, numerous companies are releasing products featuring various LLMs, with many more models in development and awaiting release. Deep Learning research is a competitive field, with only models that reach top performance attracting attention and being utilized. Hence, achieving better accuracy and results is often the first priority, while the model's efficiency and the environmental impact of the study are neglected. However, LLMs demand substantial computational resources and are very costly to train, both financially and environmentally. It becomes essential to raise awareness and promote conscious decisions about algorithmic and hardware choices. Providing information on training time, the approximate carbon dioxide emissions and power consumption would assist future studies in making necessary adjustments and determining the compatibility of available computational resources with model requirements. In this study, we infused T5 LLM with external knowledge and fine-tuned the model for Question-Answering task. Furthermore, we calculated and reported the approximate environmental impact for both steps. The findings demonstrate that the smaller models may not always be sustainable options, and increased training does not always imply better performance. The most optimal outcome is achieved by carefully considering both performance and efficiency factors.

* Presented at Bonn Sustainable AI 2023 conference

Via

Access Paper or Ask Questions

DBLPLink: An Entity Linker for the DBLP Scholarly Knowledge Graph

Sep 25, 2023

Debayan Banerjee, Arefa, Ricardo Usbeck, Chris Biemann

Abstract:In this work, we present a web application named DBLPLink, which performs entity linking over the DBLP scholarly knowledge graph. DBLPLink uses text-to-text pre-trained language models, such as T5, to produce entity label spans from an input text question. Entity candidates are fetched from a database based on the labels, and an entity re-ranker sorts them based on entity embeddings, such as TransE, DistMult and ComplEx. The results are displayed so that users may compare and contrast the results between T5-small, T5-base and the different KG embeddings used. The demo can be accessed at https://ltdemos.informatik.uni-hamburg.de/dblplink/.

* Accepted at International Semantic Web Conference (ISWC) 2023 Posters & Demo Track

Via

Access Paper or Ask Questions

The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

May 24, 2023

Debayan Banerjee, Pranav Ajit Nair, Ricardo Usbeck, Chris Biemann

Figure 1 for The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Figure 2 for The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Figure 3 for The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Figure 4 for The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Abstract:In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.

* Accepted as a short paper to ACL 2023 findings

Via

Access Paper or Ask Questions

DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

Mar 29, 2023

Debayan Banerjee, Sushil Awale, Ricardo Usbeck, Chris Biemann

Figure 1 for DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

Figure 2 for DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

Figure 3 for DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

Figure 4 for DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

Abstract:In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. DBLP-QuAD is the largest scholarly question answering dataset.

* 12 pages ceur-ws 1 column accepted at International Bibliometric Information Retrieval Workshp @ ECIR 2023

Via

Access Paper or Ask Questions

GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

Mar 28, 2023

Debayan Banerjee, Pranav Ajit Nair, Ricardo Usbeck, Chris Biemann

Figure 1 for GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

Figure 2 for GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

Figure 3 for GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

Figure 4 for GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

Abstract:In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.

* 16 pages single column format accepted at ESWC 2023 research track

Via

Access Paper or Ask Questions

A System for Human-AI collaboration for Online Customer Support

Feb 07, 2023

Debayan Banerjee, Mathis Poser, Christina Wiethof, Varun Shankar Subramanian, Richard Paucar, Eva A. C. Bittner, Chris Biemann

Figure 1 for A System for Human-AI collaboration for Online Customer Support

Figure 2 for A System for Human-AI collaboration for Online Customer Support

Figure 3 for A System for Human-AI collaboration for Online Customer Support

Figure 4 for A System for Human-AI collaboration for Online Customer Support

Abstract:AI enabled chat bots have recently been put to use to answer customer service queries, however it is a common feedback of users that bots lack a personal touch and are often unable to understand the real intent of the user's question. To this end, it is desirable to have human involvement in the customer servicing process. In this work, we present a system where a human support agent collaborates in real-time with an AI agent to satisfactorily answer customer queries. We describe the user interaction elements of the solution, along with the machine learning techniques involved in the AI agent.

Via

Access Paper or Ask Questions

ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System

Jan 25, 2023

Debayan Banerjee, Seid Muhie Yimam, Sushil Awale, Chris Biemann

Figure 1 for ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System

Figure 2 for ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System

Figure 3 for ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System

Figure 4 for ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System

Abstract:In this work, we present ARDIAS, a web-based application that aims to provide researchers with a full suite of discovery and collaboration tools. ARDIAS currently allows searching for authors and articles by name and gaining insights into the research topics of a particular researcher. With the aid of AI-based tools, ARDIAS aims to recommend potential collaborators and topics to researchers. In the near future, we aim to add tools that allow researchers to communicate with each other and start new projects.

Via

Access Paper or Ask Questions