Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nghi D. Q. Bui

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

Mar 05, 2026

Nghi D. Q. Bui

Abstract:The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-based agents offer unprecedented autonomy for long-horizon development tasks. In this paper, we present OPENDEV, an open-source, command-line coding agent engineered specifically for this new paradigm. Effective autonomous assistance requires strict safety controls and highly efficient context management to prevent context bloat and reasoning degradation. OPENDEV overcomes these challenges through a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction that progressively reduces older observations. Furthermore, it employs an automated memory system to accumulate project-specific knowledge across sessions and counteracts instruction fade-out through event-driven system reminders. By enforcing explicit reasoning phases and prioritizing context efficiency, OPENDEV provides a secure, extensible foundation for terminal-first AI assistance, offering a blueprint for robust autonomous software engineering.

* Work in progress, new versions will be updated continuously

Via

Access Paper or Ask Questions

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Dec 23, 2025

Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui

Figure 1 for SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Figure 2 for SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Figure 3 for SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Figure 4 for SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Abstract:Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

Via

Access Paper or Ask Questions

On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

May 16, 2025

Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Abstract:Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

* 100 pages

Via

Access Paper or Ask Questions

SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

Apr 20, 2025

Minh V. T. Pham, Huy N. Phan, Hoang N. Phan, Cuong Le Chi, Tien N. Nguyen, Nghi D. Q. Bui

Abstract:Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets at the repository level. SWE-Synth leverages LLM agents to simulate debugging workflows, producing not only bug-fix pairs but also test cases and structured repair trajectories. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Experiments show that models trained on SWE-Synth outperform those trained on real-world datasets by 2.3% on SWE-Bench Lite. Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.

* Work in progress

Via

Access Paper or Ask Questions

Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Aug 06, 2024

Nam Le Hai, Nghi D. Q. Bui

Figure 1 for Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Figure 2 for Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Figure 3 for Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Figure 4 for Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Abstract:Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of comments is meaningless and counterproductive. As a result, it is critical to automatically filter out these comments for specific purposes. In this paper, we present Dopamin, a Transformer-based tool for dealing with this issue. Our model excels not only in presenting knowledge sharing of common categories across multiple languages, but also in achieving robust performance in comment classification by improving comment representation. As a result, it outperforms the STACC baseline by 3% on the NLBSE'24 Tool Competition dataset in terms of average F1-score, while maintaining a comparable inference time for practical use. The source code is publicity available at https://github.com/FSoft-AI4Code/Dopamin.

* Accepted at The 3rd Intl. Workshop on NL-based Software Engineering, 2024

Via

Access Paper or Ask Questions

XMainframe: A Large Language Model for Mainframe Modernization

Aug 05, 2024

Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

Figure 1 for XMainframe: A Large Language Model for Mainframe Modernization

Figure 2 for XMainframe: A Large Language Model for Mainframe Modernization

Figure 3 for XMainframe: A Large Language Model for Mainframe Modernization

Figure 4 for XMainframe: A Large Language Model for Mainframe Modernization

Abstract:Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.

Via

Access Paper or Ask Questions

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

Jun 17, 2024

Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui

Abstract:The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. \methodnamews provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. \methodnamews aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.

Via

Access Paper or Ask Questions

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

Jun 16, 2024

Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, Nghi D. Q. Bui

Abstract:Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify software development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally developing software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi-agent systems in advanced software engineering environments. Our source code can be found at https://github.com/FSoft-AI4Code/AgileCoder.

Via

Access Paper or Ask Questions

Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

Mar 21, 2024

Khanh Nghiem, Anh Minh Nguyen, Nghi D. Q. Bui

Figure 1 for Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

Figure 2 for Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

Abstract:As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open questions and challenges that academia and industry should address to realize the vision of next-generation AI coding assistants.

Via

Access Paper or Ask Questions

RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion

Mar 16, 2024

Huy N. Phan, Hoang N. Phan, Tien N. Nguyen, Nghi D. Q. Bui

Abstract:Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to \tool is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages \textit{Expand and Refine} retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHyper can be found at~\url{https://github.com/FSoft-AI4Code/RepoHyper}.

* Under Review

Via

Access Paper or Ask Questions