Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sofiane Batata

STCALIR: Semi-Synthetic Test Collection for Algerian Legal Information Retrieval

Apr 01, 2026

M'hamed Amine Hatem, Sofiane Batata, Amine Mammasse, Faiçal Azouaou

Abstract:Test collections are essential for evaluating retrieval and re-ranking models. However, constructing such collections is challenging due to the high cost of manual annotation, particularly in specialized domains like Algerian legal texts, where high-quality corpora and relevance judgments are scarce. To address this limitation, we propose STCALIR, a framework for generating semi-synthetic test collections directly from raw legal documents. The pipeline follows the Cranfield paradigm, maintaining its core components of topics, corpus, and relevance judgments, while significantly reducing manual effort through automated multi-stage retrieval and filtering, achieving a 99% reduction in annotation workload. We validate STCALIR using the Mr. TyDi benchmark, demonstrating that the resulting semi-synthetic relevance judgments yield retrieval effectiveness comparable to human-annotated evaluations (Hit@10 \approx 0.785). Furthermore, system-level rankings derived from these labels exhibit strong concordance with human-based evaluations, as measured by Kendall's τ (0.89) and Spearman's \r{ho} (0.92). Overall, STCALIR offers a reproducible and cost-efficient solution for constructing reliable test collections in low-resource legal domains.

Via

Access Paper or Ask Questions

A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Jan 10, 2022

Mehdi Abdelhamid, Faical Azouaou, Sofiane Batata

Figure 1 for A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Figure 2 for A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Figure 3 for A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Figure 4 for A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Abstract:In academia, plagiarism is certainly not an emerging concern, but it became of a greater magnitude with the popularisation of the Internet and the ease of access to a worldwide source of content, rendering human-only intervention insufficient. Despite that, plagiarism is far from being an unaddressed problem, as computer-assisted plagiarism detection is currently an active area of research that falls within the field of Information Retrieval (IR) and Natural Language Processing (NLP). Many software solutions emerged to help fulfil this task, and this paper presents an overview of plagiarism detection systems for use in Arabic, French, and English academic and educational settings. The comparison was held between eight systems and was performed with respect to their features, usability, technical aspects, as well as their performance in detecting three levels of obfuscation from different sources: verbatim, paraphrase, and cross-language plagiarism. An indepth examination of technical forms of plagiarism was also performed in the context of this study. In addition, a survey of plagiarism typologies and classifications proposed by different authors is provided.

* 26 pages, 2 figures, 19 tables

Via

Access Paper or Ask Questions