Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gema Rodríguez-Pérez

More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests

Jan 29, 2026

Haoming Huang, Pongchai Jaisri, Shota Shimizu, Lingfeng Chen, Sota Nakashima, Gema Rodríguez-Pérez

Abstract:Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce adverse effects on development. However, existing metrics solely measure pass rates, failing to reflect impacts on long-term maintainability and readability, and failing to capture human intuitive evaluations of PR. To increase the comprehensiveness of this problem, we investigate and evaluate the characteristics of LLM to know the pull requests' characteristics beyond the pass rate. We observe the code quality and maintainability within PRs based on code metrics to evaluate objective characteristics and developers' reactions to the pull requests from both humans and LLM's generation. Evaluation results indicate that LLM Agents frequently disregard code reuse opportunities, resulting in higher levels of redundancy compared to human developers. In contrast to the quality issues, our emotions analysis reveals that reviewers tend to express more neutral or positive emotions towards AI-generated contributions than human ones. This disconnect suggests that the surface-level plausibility of AI code masks redundancy, leading to the silent accumulation of technical debt in real-world development environments. Our research provides insights for improving human-AI collaboration.

* Accepted to MSR 2026

Via

Access Paper or Ask Questions

A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Nov 23, 2024

Rohit Dandamudi, Gema Rodríguez-Pérez

Figure 1 for A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Figure 2 for A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Figure 3 for A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Figure 4 for A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Abstract:Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the lack of high-quality benchmarks across various programming languages and the imbalanced nature of the CLMs training corpus. Although recent advances in one of the common downstream tasks, code generation, have shown promise by introducing translated benchmarks using different methodologies, there is a current lack of empirical evidence assessing these benchmarks. To address this gap, we conducted a preliminary study to evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation. We utilized two existing state-of-the-art translations of the popular code generation benchmark, HumanEval, facilitated by the OctoPack and MultiPL-E studies. Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase, such as perplexity, thereby validating their effectiveness in estimating the performance of CLMs. However, we identified several inconsistencies in the CLMs' performance across the translated benchmarks and encountered challenges in replicating the results. These initial insights highlight the need for more comprehensive empirical studies to fully understand translated benchmarks' methodological approaches, limitations, and reproducibility. Such studies are essential to ensure their reliability before they are widely adopted.

* ASEW 2024: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops, Pages 94 - 99
* 5 pages, ASEW 2024

Via

Access Paper or Ask Questions

Investigating the Efficacy of Large Language Models for Code Clone Detection

Jan 30, 2024

Mohamad Khajezade, Jie JW Wu, Fatemeh Hendijani Fard, Gema Rodríguez-Pérez, Mohamed Sami Shehata

Figure 1 for Investigating the Efficacy of Large Language Models for Code Clone Detection

Figure 2 for Investigating the Efficacy of Large Language Models for Code Clone Detection

Figure 3 for Investigating the Efficacy of Large Language Models for Code Clone Detection

Figure 4 for Investigating the Efficacy of Large Language Models for Code Clone Detection

Abstract:Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to guide the model in accomplishing the task. GPT-based models are one of the popular ones studied for tasks such as code comment generation or test generation. These tasks are `generative' tasks. However, there is limited research on the usage of LLMs for `non-generative' tasks such as classification using the prompt-based paradigm. In this preliminary exploratory study, we investigated the applicability of LLMs for Code Clone Detection (CCD), a non-generative task. By building a mono-lingual and cross-lingual CCD dataset derived from CodeNet, we first investigated two different prompts using ChatGPT to detect Type-4 code clones in Java-Java and Java-Ruby pairs in a zero-shot setting. We then conducted an analysis to understand the strengths and weaknesses of ChatGPT in CCD. ChatGPT surpasses the baselines in cross-language CCD attaining an F1-score of 0.877 and achieves comparable performance to fully fine-tuned models for mono-lingual CCD, with an F1-score of 0.878. Also, the prompt and the difficulty level of the problems has an impact on the performance of ChatGPT. Finally we provide insights and future directions based on our initial analysis

Via

Access Paper or Ask Questions