Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

M. Ali Babar

University of Adelaide, CREST - The Centre for Research on Engineering Software Technologies

DDPT: Diffusion-Driven Prompt Tuning for Large Language Model Code Generation

Apr 06, 2025

Jinyang Li, Sangwon Hyun, M. Ali Babar

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation. However, the quality of the generated code is heavily dependent on the structure and composition of the prompts used. Crafting high-quality prompts is a challenging task that requires significant knowledge and skills of prompt engineering. To advance the automation support for the prompt engineering for LLM-based code generation, we propose a novel solution Diffusion-Driven Prompt Tuning (DDPT) that learns how to generate optimal prompt embedding from Gaussian Noise to automate the prompt engineering for code generation. We evaluate the feasibility of diffusion-based optimization and abstract the optimal prompt embedding as a directional vector toward the optimal embedding. We use the code generation loss given by the LLMs to help the diffusion model capture the distribution of optimal prompt embedding during training. The trained diffusion model can build a path from the noise distribution to the optimal distribution at the sampling phrase, the evaluation result demonstrates that DDPT helps improve the prompt optimization for code generation.

* ICSE CAIN 2025

Via

Access Paper or Ask Questions

LLMSecConfig: An LLM-Based Approach for Fixing Software Container Misconfigurations

Feb 04, 2025

Ziyang Ye, Triet Huynh Minh Le, M. Ali Babar

Abstract:Security misconfigurations in Container Orchestrators (COs) can pose serious threats to software systems. While Static Analysis Tools (SATs) can effectively detect these security vulnerabilities, the industry currently lacks automated solutions capable of fixing these misconfigurations. The emergence of Large Language Models (LLMs), with their proven capabilities in code understanding and generation, presents an opportunity to address this limitation. This study introduces LLMSecConfig, an innovative framework that bridges this gap by combining SATs with LLMs. Our approach leverages advanced prompting techniques and Retrieval-Augmented Generation (RAG) to automatically repair security misconfigurations while preserving operational functionality. Evaluation of 1,000 real-world Kubernetes configurations achieved a 94\% success rate while maintaining a low rate of introducing new misconfigurations. Our work makes a promising step towards automated container security management, reducing the manual effort required for configuration maintenance.

Via

Access Paper or Ask Questions

MVD: A Multi-Lingual Software Vulnerability Detection Framework

Dec 09, 2024

Boyu Zhang, Triet H. M. Le, M. Ali Babar

Abstract:Software vulnerabilities can result in catastrophic cyberattacks that increasingly threaten business operations. Consequently, ensuring the safety of software systems has become a paramount concern for both private and public sectors. Recent literature has witnessed increasing exploration of learning-based approaches for software vulnerability detection. However, a key limitation of these techniques is their primary focus on a single programming language, such as C/C++, which poses constraints considering the polyglot nature of modern software projects. Further, there appears to be an oversight in harnessing the synergies of vulnerability knowledge across varied languages, potentially underutilizing the full capabilities of these methods. To address the aforementioned issues, we introduce MVD - an innovative multi-lingual vulnerability detection framework. This framework acquires the ability to detect vulnerabilities across multiple languages by concurrently learning from vulnerability data of various languages, which are curated by our specialized pipeline. We also incorporate incremental learning to enable the detection capability of MVD to be extended to new languages, thus augmenting its practical utility. Extensive experiments on our curated dataset of more than 11K real-world multi-lingual vulnerabilities substantiate that our framework significantly surpasses state-of-the-art methods in multi-lingual vulnerability detection by 83.7% to 193.6% in PR-AUC. The results also demonstrate that MVD detects vulnerabilities well for new languages without compromising the detection performance of previously trained languages, even when training data for the older languages is unavailable. Overall, our findings motivate and pave the way for the prediction of multi-lingual vulnerabilities in modern software systems.

Via

Access Paper or Ask Questions

Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++

Aug 03, 2024

Anh The Nguyen, Triet Huynh Minh Le, M. Ali Babar

Figure 1 for Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++

Figure 2 for Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++

Figure 3 for Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++

Figure 4 for Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++

Abstract:Background: The C and C++ languages hold significant importance in Software Engineering research because of their widespread use in practice. Numerous studies have utilized Machine Learning (ML) and Deep Learning (DL) techniques to detect software vulnerabilities (SVs) in the source code written in these languages. However, the application of these techniques in function-level SV assessment has been largely unexplored. SV assessment is increasingly crucial as it provides detailed information on the exploitability, impacts, and severity of security defects, thereby aiding in their prioritization and remediation. Aims: We conduct the first empirical study to investigate and compare the performance of ML and DL models, many of which have been used for SV detection, for function-level SV assessment in C/C++. Method: Using 9,993 vulnerable C/C++ functions, we evaluated the performance of six multi-class ML models and five multi-class DL models for the SV assessment at the function level based on the Common Vulnerability Scoring System (CVSS). We further explore multi-task learning, which can leverage common vulnerable code to predict all SV assessment outputs simultaneously in a single model, and compare the effectiveness and efficiency of this model type with those of the original multi-class models. Results: We show that ML has matching or even better performance compared to the multi-class DL models for function-level SV assessment with significantly less training time. Employing multi-task learning allows the DL models to perform significantly better, with an average of 8-22% increase in Matthews Correlation Coefficient (MCC). Conclusions: We distill the practices of using data-driven techniques for function-level SV assessment in C/C++, including the use of multi-task DL to balance efficiency and effectiveness. This can establish a strong foundation for future work in this area.

* Accepted as a full paper in the technical track at The International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024

Via

Access Paper or Ask Questions

A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality

Aug 01, 2024

M. Mehdi Kholoosi, M. Ali Babar, Roland Croft

Figure 1 for A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality

Figure 2 for A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality

Figure 3 for A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality

Figure 4 for A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality

Abstract:Artificial Intelligence (AI) advancements have enabled the development of Large Language Models (LLMs) that can perform a variety of tasks with remarkable semantic understanding and accuracy. ChatGPT is one such LLM that has gained significant attention due to its impressive capabilities for assisting in various knowledge-intensive tasks. Due to the knowledge-intensive nature of engineering secure software, ChatGPT's assistance is expected to be explored for security-related tasks during the development/evolution of software. To gain an understanding of the potential of ChatGPT as an emerging technology for supporting software security, we adopted a two-fold approach. Initially, we performed an empirical study to analyse the perceptions of those who had explored the use of ChatGPT for security tasks and shared their views on Twitter. It was determined that security practitioners view ChatGPT as beneficial for various software security tasks, including vulnerability detection, information retrieval, and penetration testing. Secondly, we designed an experiment aimed at investigating the practicality of this technology when deployed as an oracle in real-world settings. In particular, we focused on vulnerability detection and qualitatively examined ChatGPT outputs for given prompts within this prominent software security task. Based on our analysis, responses from ChatGPT in this task are largely filled with generic security information and may not be appropriate for industry use. To prevent data leakage, we performed this analysis on a vulnerability dataset compiled after the OpenAI data cut-off date from real-world projects covering 40 distinct vulnerability types and 12 programming languages. We assert that the findings from this study would contribute to future research aimed at developing and evaluating LLMs dedicated to software security.

* Accepted for publication at International Conference on Trust, Privacy and Security - 2024

Via

Access Paper or Ask Questions

Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?

Jul 25, 2024

Triet H. M. Le, M. Ali Babar

Abstract:Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the noisy auto-labeled SVs can perform up to 22% and 90% better in Matthews Correlation Coefficient and Recall, respectively, than the original models. We also reveal the promises and difficulties of applying noise-reduction methods for automatically addressing the noise in auto-labeled SV data to maximize the data utilization for SV prediction. Conclusions: Our study informs the benefits and challenges of using auto-labeled SVs, paving the way for large-scale SV prediction.

* Accepted as a full paper in the technical track at The International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024

Via

Access Paper or Ask Questions

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Jul 15, 2024

Triet H. M. Le, M. Ali Babar

Abstract:Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.

* Accepted as a full paper in the technical track at The International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024

Via

Access Paper or Ask Questions

Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Apr 26, 2024

Triet H. M. Le, M. Ali Babar, Tung Hoang Thai

Figure 1 for Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Figure 2 for Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Figure 3 for Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Figure 4 for Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

Abstract:Background: Software Vulnerability (SV) prediction in emerging languages is increasingly important to ensure software security in modern systems. However, these languages usually have limited SV data for developing high-performing prediction models. Aims: We conduct an empirical study to evaluate the impact of SV data scarcity in emerging languages on the state-of-the-art SV prediction model and investigate potential solutions to enhance the performance. Method: We train and test the state-of-the-art model based on CodeBERT with and without data sampling techniques for function-level and line-level SV prediction in three low-resource languages - Kotlin, Swift, and Rust. We also assess the effectiveness of ChatGPT for low-resource SV prediction given its recent success in other domains. Results: Compared to the original work in C/C++ with large data, CodeBERT's performance of function-level and line-level SV prediction significantly declines in low-resource languages, signifying the negative impact of data scarcity. Regarding remediation, data sampling techniques fail to improve CodeBERT; whereas, ChatGPT showcases promising results, substantially enhancing predictive performance by up to 34.4% for the function level and up to 53.5% for the line level. Conclusion: We have highlighted the challenge and made the first promising step for low-resource SV prediction, paving the way for future research in this direction.

* Accepted in the 4th International Workshop on Software Security co-located with the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE) 2024

Via

Access Paper or Ask Questions

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Jan 20, 2024

Triet H. M. Le, Xiaoning Du, M. Ali Babar

Figure 1 for Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Figure 2 for Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Figure 3 for Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Figure 4 for Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Abstract:Collecting relevant and high-quality data is integral to the development of effective Software Vulnerability (SV) prediction models. Most of the current SV datasets rely on SV-fixing commits to extract vulnerable functions and lines. However, none of these datasets have considered latent SVs existing between the introduction and fix of the collected SVs. There is also little known about the usefulness of these latent SVs for SV prediction. To bridge these gaps, we conduct a large-scale study on the latent vulnerable functions in two commonly used SV datasets and their utilization for function-level and line-level SV predictions. Leveraging the state-of-the-art SZZ algorithm, we identify more than 100k latent vulnerable functions in the studied datasets. We find that these latent functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions, yet they have a noise level of around 6%. Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs. The improvements are up to 24.5% in the performance (F1-Score) of function-level SV predictions and up to 67% in the effectiveness of localizing vulnerable lines. Overall, our study presents the first promising step toward the use of latent SVs to improve the quality of SV datasets and enhance the performance of SV prediction tasks.

* Accepted as a full paper in the technical track at the 21st International Conference on Mining Software Repositories (MSR) 2024

Via

Access Paper or Ask Questions

METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities

Dec 11, 2023

Sangwon Hyun, Mingyu Guo, M. Ali Babar

Abstract:Large-Language Models (LLMs) have shifted the paradigm of natural language data processing. However, their black-boxed and probabilistic characteristics can lead to potential risks in the quality of outputs in diverse LLM applications. Recent studies have tested Quality Attributes (QAs), such as robustness or fairness, of LLMs by generating adversarial input texts. However, existing studies have limited their coverage of QAs and tasks in LLMs and are difficult to extend. Additionally, these studies have only used one evaluation metric, Attack Success Rate (ASR), to assess the effectiveness of their approaches. We propose a MEtamorphic Testing for Analyzing LLMs (METAL) framework to address these issues by applying Metamorphic Testing (MT) techniques. This approach facilitates the systematic testing of LLM qualities by defining Metamorphic Relations (MRs), which serve as modularized evaluation metrics. The METAL framework can automatically generate hundreds of MRs from templates that cover various QAs and tasks. In addition, we introduced novel metrics that integrate the ASR method into the semantic qualities of text to assess the effectiveness of MRs accurately. Through the experiments conducted with three prominent LLMs, we have confirmed that the METAL framework effectively evaluates essential QAs on primary LLM tasks and reveals the quality risks in LLMs. Moreover, the newly proposed metrics can guide the optimal MRs for testing each task and suggest the most effective method for generating MRs.

* Accepted to International Conference on Software Testing, Verification and Validation (ICST) 2024 / Key words: Large-language models, Metamorphic testing, Quality evaluation, Text perturbations

Via

Access Paper or Ask Questions