Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gouri Ginde

Department of Electrical and Software Engineering, University of Calgary, Calgary, Canada

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

May 22, 2026

Hinduja Nirujan, Shreyas Patil, Abdallah Ayoub, Ahmad Abdel Latif, Gouri Ginde

Abstract:Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

Via

Access Paper or Ask Questions

Do Masked Autoencoders Improve Downhole Prediction? An Empirical Study on Real Well Drilling Data

Apr 21, 2026

Aleksander Berezowski, Hassan Hassanzadeh, Gouri Ginde

Abstract:Downhole drilling telemetry presents a fundamental labeling asymmetry: surface sensor data are generated continuously at 1~Hz, while labeled downhole measurements are costly, intermittent, and scarce. Current machine learning approaches for downhole metric prediction universally adopt fully supervised training from scratch, which is poorly suited to this data regime. We present the first empirical evaluation of masked autoencoder (MAE) pretraining for downhole drilling metric prediction. Using two publicly available Utah FORGE geothermal wells comprising approximately 3.5 million timesteps of multivariate drilling telemetry, we conduct a systematic full-factorial design space search across 72 MAE configurations and compare them against supervised LSTM and GRU baselines on the task of predicting Total Mud Volume. Results show that the best MAE configuration reduces test mean absolute error by 19.8\% relative to the supervised GRU baseline, while trailing the supervised LSTM baseline by 6.4\%. Analysis of design dimensions reveals that latent space width is the dominant architectural choice (Pearson $r = -0.59$ with test MAE), while masking ratio has negligible effect, an unexpected finding attributed to high temporal redundancy in 1~Hz drilling data. These results establish MAE pretraining as a viable paradigm for drilling analytics and identify the conditions under which it is most beneficial.

Via

Access Paper or Ask Questions

Assessing the Potential of Masked Autoencoder Foundation Models in Predicting Downhole Metrics from Surface Drilling Data

Apr 16, 2026

Aleksander Berezowski, Hassan Hassanzadeh, Gouri Ginde

Abstract:Oil and gas drilling operations generate extensive time-series data from surface sensors, yet accurate real-time prediction of critical downhole metrics remains challenging due to the scarcity of labelled downhole measurements. This systematic mapping study reviews thirteen papers published between 2015 and 2025 to assess the potential of Masked Autoencoder Foundation Models (MAEFMs) for predicting downhole metrics from surface drilling data. The review identifies eight commonly collected surface metrics and seven target downhole metrics. Current approaches predominantly employ neural network architectures such as artificial neural networks (ANNs) and long short-term memory (LSTM) networks, yet no studies have explored MAEFMs despite their demonstrated effectiveness in time-series modeling. MAEFMs offer distinct advantages through self-supervised pre-training on abundant unlabeled data, enabling multi-task prediction and improved generalization across wells. This research establishes that MAEFMs represent a technically feasible but unexplored opportunity for drilling analytics, recommending future empirical validation of their performance against existing models and exploration of their broader applicability in oil and gas operations.

Via

Access Paper or Ask Questions

Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Apr 26, 2025

Jagrit Acharya, Gouri Ginde

Figure 1 for Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Figure 2 for Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Figure 3 for Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Figure 4 for Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Abstract:Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we explore whether Instruction fine-tuned Large Language Models (LLMs) can automatically transform casual, unstructured bug reports into high-quality, structured bug reports adhering to a standard template. We evaluate three open-source instruction-tuned LLMs (\emph{Qwen 2.5, Mistral, and Llama 3.2}) against ChatGPT-4o, measuring performance on established metrics such as CTQRS, ROUGE, METEOR, and SBERT. Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of \textbf{77%}, outperforming both fine-tuned Mistral (\textbf{71%}), Llama 3.2 (\textbf{63%}) and ChatGPT in 3-shot learning (\textbf{75%}). Further analysis reveals that Llama 3.2 shows higher accuracy of detecting missing fields particularly Expected Behavior and Actual Behavior, while Qwen 2.5 demonstrates superior performance in capturing Steps-to-Reproduce, with an F1 score of 76%. Additional testing of the models on other popular projects (e.g., Eclipse, GCC) demonstrates that our approach generalizes well, achieving up to \textbf{70%} CTQRS in unseen projects' bug reports. These findings highlight the potential of instruction fine-tuning in automating structured bug report generation, reducing manual effort for developers and streamlining the software maintenance process.

Via

Access Paper or Ask Questions

Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

Nov 11, 2024

Aakash Sorathiya, Gouri Ginde

Figure 1 for Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

Figure 2 for Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

Figure 3 for Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

Figure 4 for Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

Abstract:With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.

Via

Access Paper or Ask Questions

Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems

Sep 02, 2024

Farzaneh Dehghani, Mahsa Dibaji, Fahim Anzum, Lily Dey, Alican Basdemir, Sayeh Bayat, Jean-Christophe Boucher, Steve Drew, Sarah Elaine Eaton, Richard Frayne(+16 more)

Abstract:Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe, reliable, ethical, and Trustworthy AI systems is essential. Our team of researchers working with Trustworthy and Responsible AI, part of the Transdisciplinary Scholarship Initiative within the University of Calgary, conducts research on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models.

* 44 pages, 2 figures

Via

Access Paper or Ask Questions

PRAGyan -- Connecting the Dots in Tweets

Jul 18, 2024

Rahul Ravi, Gouri Ginde, Jon Rokne

Figure 1 for PRAGyan -- Connecting the Dots in Tweets

Figure 2 for PRAGyan -- Connecting the Dots in Tweets

Figure 3 for PRAGyan -- Connecting the Dots in Tweets

Figure 4 for PRAGyan -- Connecting the Dots in Tweets

Abstract:As social media platforms grow, understanding the underlying reasons behind events and statements becomes crucial for businesses, policymakers, and researchers. This research explores the integration of Knowledge Graphs (KGs) with Large Language Models (LLMs) to perform causal analysis of tweets dataset. The LLM aided analysis techniques often lack depth in uncovering the causes driving observed effects. By leveraging KGs and LLMs, which encode rich semantic relationships and temporal information, this study aims to uncover the complex interplay of factors influencing causal dynamics and compare the results obtained using GPT-3.5 Turbo. We employ a Retrieval-Augmented Generation (RAG) model, utilizing a KG stored in a Neo4j (a.k.a PRAGyan) data format, to retrieve relevant context for causal reasoning. Our approach demonstrates that the KG-enhanced LLM RAG can provide improved results when compared to the baseline LLM (GPT-3.5 Turbo) model as the source corpus increases in size. Our qualitative analysis highlights the advantages of combining KGs with LLMs for improved interpretability and actionable insights, facilitating informed decision-making across various domains. Whereas, quantitative analysis using metrics such as BLEU and cosine similarity show that our approach outperforms the baseline by 10\%.

* 9 pages, ASONAM

Via

Access Paper or Ask Questions