Abstract:Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe's superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.
Abstract:Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.
Abstract:Nowadays, there are many diffusion and autoregressive models that show impressive results for generating images from text and other input domains. However, these methods are not intended for ultra-high-resolution image synthesis. Vector graphics are devoid of this disadvantage, so the generation of images in this format looks very promising. Instead of generating vector images directly, you can first synthesize a raster image and then apply vectorization. Vectorization is the process of converting a raster image into a similar vector image using primitive shapes. Besides being similar, generated vector image is also required to contain the minimum number of shapes for rendering. In this paper, we focus specifically on machine learning-compatible vectorization methods. We are considering Mang2Vec, Deep Vectorization of Technical Drawings, DiffVG, and LIVE models. We also provide a brief overview of existing online methods. We also recall other algorithmic methods, Im2Vec and ClipGEN models, but they do not participate in the comparison, since there is no open implementation of these methods or their official implementations do not work correctly. Our research shows that despite the ability to directly specify the number and type of shapes, existing machine learning methods work for a very long time and do not accurately recreate the original image. We believe that there is no fast universal automatic approach and human control is required for every method.