Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Maninger

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Nov 06, 2025

Amir Molzam Sharifloo, Maedeh Heydari, Parsa Kazerooni, Daniel Maninger, Mira Mezini

Figure 1 for Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Figure 2 for Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Figure 3 for Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Figure 4 for Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Abstract:Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.

* To be published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data & Benchmark Track

Via

Access Paper or Ask Questions

Towards Trustworthy AI Software Development Assistance

Dec 14, 2023

Daniel Maninger, Krishna Narasimhan, Mira Mezini

Abstract:It is expected that in the near future, AI software development assistants will play an important role in the software industry. However, current software development assistants tend to be unreliable, often producing incorrect, unsafe, or low-quality code. We seek to resolve these issues by introducing a holistic architecture for constructing, training, and using trustworthy AI software development assistants. In the center of the architecture, there is a foundational LLM trained on datasets representative of real-world coding scenarios and complex software architectures, and fine-tuned on code quality criteria beyond correctness. The LLM will make use of graph-based code representations for advanced semantic comprehension. We envision a knowledge graph integrated into the system to provide up-to-date background knowledge and to enable the assistant to provide appropriate explanations. Finally, a modular framework for constrained decoding will ensure that certain guarantees (e.g., for correctness and security) hold for the generated code.

* 6 pages, 1 figure; to be published in ICSE-NIER '24: Proceedings of the 46th International Conference on Software Engineering: New Ideas and Emerging Results

Via

Access Paper or Ask Questions