Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Suresh Kothari

Unlocking LLM Code Correction with Iterative Feedback Loops

Jun 16, 2026

Le Zhang, Suresh Kothari

Abstract:Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs' ability to rectify their own code through execution feedback. Using real-world programming problems across four models and two major programming languages, this study evaluates performance using iterative refinement framework where LLMs receive compiler error messages and testcase feedback after each attempt. This study introduces metrics to evaluate code failures, analyze rectification patterns, and compare the effectiveness of reasoning and non-reasoning models, offering actionable insights into both the understanding and practical application of feedback loops in LLM-driven code generation systems. Results show that reasoning models consistently improve over iterations, substantially outperforming non-reasoning models in leveraging feedback, while syntactic and runtime errors are far more tractable than logical or algorithmic failures.

* 22 pages, 14th Computing Conference 2026

Via

Access Paper or Ask Questions

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Dec 19, 2025

Le Zhang, Suresh Kothari

Figure 1 for Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Figure 2 for Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Figure 3 for Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Figure 4 for Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Abstract:This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and practitioners, emphasizing that successful LLM deployment depends on careful model selection, effective prompt design, and context-aware usage to ensure reliable code generation in real-world software development tasks.

* 13 pages, 9 figures, 6 tables

Via

Access Paper or Ask Questions