Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Georgii Levtsov

Challenge on Optimization of Context Collection for Code Completion

Oct 05, 2025

Dmitry Ustalov, Egor Bogomolov, Alexander Bezzubov, Yaroslav Golubev, Evgeniy Glukhov, Georgii Levtsov, Vladimir Kovalenko

Figure 1 for Challenge on Optimization of Context Collection for Code Completion

Figure 2 for Challenge on Optimization of Context Collection for Code Completion

Figure 3 for Challenge on Optimization of Context Collection for Code Completion

Figure 4 for Challenge on Optimization of Context Collection for Code Completion

Abstract:The rapid advancement of workflows and methods for software engineering using AI emphasizes the need for a systematic evaluation and analysis of their ability to leverage information from entire projects, particularly in large code bases. In this challenge on optimization of context collection for code completion, organized by JetBrains in collaboration with Mistral AI as part of the ASE 2025 conference, participants developed efficient mechanisms for collecting context from source code repositories to improve fill-in-the-middle code completions for Python and Kotlin. We constructed a large dataset of real-world code in these two programming languages using permissively licensed open-source projects. The submissions were evaluated based on their ability to maximize completion quality for multiple state-of-the-art neural models using the chrF metric. During the public phase of the competition, nineteen teams submitted solutions to the Python track and eight teams submitted solutions to the Kotlin track. In the private phase, six teams competed, of which five submitted papers to the workshop.

* 7 pages, 3 figures, 5 tables. A report on the Context Collection Workshop co-located with ASE'25

Via

Access Paper or Ask Questions

Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Jul 02, 2025

Georgii Levtsov, Dmitry Ustalov

Figure 1 for Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Figure 2 for Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Figure 3 for Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Figure 4 for Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Abstract:With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.

* 8 pages, accepted at ACL SRW 2025

Via

Access Paper or Ask Questions