Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fatou Ndiaye Mbodji

SIEVE: Towards Verifiable Certification for Code-datasets

Oct 02, 2025

Fatou Ndiaye Mbodji, El-hacen Diallo, Jordan Samhi, Kui Liu, Jacques Klein, Tegawendé F. Bissyande

Figure 1 for SIEVE: Towards Verifiable Certification for Code-datasets

Figure 2 for SIEVE: Towards Verifiable Certification for Code-datasets

Figure 3 for SIEVE: Towards Verifiable Certification for Code-datasets

Abstract:Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static 'dataset cards' inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.

* 5

Via

Access Paper or Ask Questions

evalSmarT: An LLM-Based Framework for Evaluating Smart Contract Generated Comments

Jul 28, 2025

Fatou Ndiaye Mbodji

Abstract:Smart contract comment generation has gained traction as a means to improve code comprehension and maintainability in blockchain systems. However, evaluating the quality of generated comments remains a challenge. Traditional metrics such as BLEU and ROUGE fail to capture domain-specific nuances, while human evaluation is costly and unscalable. In this paper, we present \texttt{evalSmarT}, a modular and extensible framework that leverages large language models (LLMs) as evaluators. The system supports over 400 evaluator configurations by combining approximately 40 LLMs with 10 prompting strategies. We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs. Our results show that prompt design significantly impacts alignment with human judgment, and that LLM-based evaluation offers a scalable and semantically rich alternative to existing methods.

* 4 pages, 4 tables

Via

Access Paper or Ask Questions