Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rokas Gipiškis

Vilnius University

Evaluation Cards for XAI Metrics

May 06, 2026

Rokas Gipiškis, Olga Kurasova

Abstract:The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.

* 7 pages. Accepted at the 5th XAI4CV Workshop, CVPR 2026 (non-archival)

Via

Access Paper or Ask Questions

Principles and Guidelines for Randomized Controlled Trials in AI Evaluation

May 03, 2026

Christopher Kelly, Angelica Chowdhury, Alexandra Campili, Bimpe Ayoola, Devin Barbour, Thomas Chen Dawson, Ze Shen Chin, Rokas Gipiškis

Abstract:This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.

* 27 pages, Technical AI Safety Conference

Via

Access Paper or Ask Questions

Deprecating Benchmarks: Criteria and Framework

Jul 08, 2025

Ayrton San Joaquin, Rokas Gipiškis, Leon Staufer, Ariel Gil

Figure 1 for Deprecating Benchmarks: Criteria and Framework

Abstract:As frontier artificial intelligence (AI) models rapidly advance, benchmarks are integral to comparing different models and measuring their progress in different task-specific domains. However, there is a lack of guidance on when and how benchmarks should be deprecated once they cease to effectively perform their purpose. This risks benchmark scores over-valuing model capabilities, or worse, obscuring capabilities and safety-washing. Based on a review of benchmarking practices, we propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks. Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models, and our recommendations are aimed to benefit benchmark developers, benchmark users, AI governance actors (across governments, academia, and industry panels), and policy makers.

* 10 pages, 1 table. Accepted to the ICML 2025 Technical AI Governance Workshop

Via

Access Paper or Ask Questions

PreCare: Designing AI Assistants for Advance Care Planning (ACP) to Enhance Personal Value Exploration, Patient Knowledge, and Decisional Confidence

May 14, 2025

Yu Lun Hsu, Yun-Rung Chou, Chiao-Ju Chang, Yu-Cheng Chang, Zer-Wei Lee, Rokas Gipiškis, Rachel Li, Chih-Yuan Shih, Jen-Kuei Peng, Hsien-Liang Huang(+2 more)

Figure 1 for PreCare: Designing AI Assistants for Advance Care Planning (ACP) to Enhance Personal Value Exploration, Patient Knowledge, and Decisional Confidence

Figure 2 for PreCare: Designing AI Assistants for Advance Care Planning (ACP) to Enhance Personal Value Exploration, Patient Knowledge, and Decisional Confidence

Figure 3 for PreCare: Designing AI Assistants for Advance Care Planning (ACP) to Enhance Personal Value Exploration, Patient Knowledge, and Decisional Confidence

Figure 4 for PreCare: Designing AI Assistants for Advance Care Planning (ACP) to Enhance Personal Value Exploration, Patient Knowledge, and Decisional Confidence

Abstract:Advance Care Planning (ACP) allows individuals to specify their preferred end-of-life life-sustaining treatments before they become incapacitated by injury or terminal illness (e.g., coma, cancer, dementia). While online ACP offers high accessibility, it lacks key benefits of clinical consultations, including personalized value exploration, immediate clarification of decision consequences. To bridge this gap, we conducted two formative studies: 1) shadowed and interviewed 3 ACP teams consisting of physicians, nurses, and social workers (18 patients total), and 2) interviewed 14 users of ACP websites. Building on these insights, we designed PreCare in collaboration with 6 ACP professionals. PreCare is a website with 3 AI-driven assistants designed to guide users through exploring personal values, gaining ACP knowledge, and supporting informed decision-making. A usability study (n=12) showed that PreCare achieved a System Usability Scale (SUS) rating of excellent. A comparative evaluation (n=12) showed that PreCare's AI assistants significantly improved exploration of personal values, knowledge, and decisional confidence, and was preferred by 92% of participants.

Via

Access Paper or Ask Questions

Existing Industry Practice for the EU AI Act's General-Purpose AI Code of Practice Safety and Security Measures

Apr 21, 2025

Lily Stelling, Mick Yang, Rokas Gipiškis, Leon Staufer, Ze Shen Chin, Siméon Campos, Michael Chen

Figure 1 for Existing Industry Practice for the EU AI Act's General-Purpose AI Code of Practice Safety and Security Measures

Abstract:This report provides a detailed comparison between the measures proposed in the EU AI Act's General-Purpose AI (GPAI) Code of Practice (Third Draft) and current practices adopted by leading AI companies. As the EU moves toward enforcing binding obligations for GPAI model providers, the Code of Practice will be key to bridging legal requirements with concrete technical commitments. Our analysis focuses on the draft's Safety and Security section which is only relevant for the providers of the most advanced models (Commitments II.1-II.16) and excerpts from current public-facing documents quotes that are relevant to each individual measure. We systematically reviewed different document types - including companies' frontier safety frameworks and model cards - from over a dozen companies, including OpenAI, Anthropic, Google DeepMind, Microsoft, Meta, Amazon, and others. This report is not meant to be an indication of legal compliance nor does it take any prescriptive viewpoint about the Code of Practice or companies' policies. Instead, it aims to inform the ongoing dialogue between regulators and GPAI model providers by surfacing evidence of precedent.

* 158 pages

Via

Access Paper or Ask Questions

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Oct 30, 2024

Rokas Gipiškis, Ayrton San Joaquin, Ze Shen Chin, Adrian Regenfuß, Ariel Gil, Koen Holtman

Figure 1 for Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Figure 2 for Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Figure 3 for Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Figure 4 for Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Abstract:There is an urgent need to identify both short and long-term risks from newly emerging types of Artificial Intelligence (AI), as well as available risk management measures. In response, and to support global efforts in regulating AI and writing safety standards, we compile an extensive catalog of risk sources and risk management measures for general-purpose AI (GPAI) systems, complete with descriptions and supporting examples where relevant. This work involves identifying technical, operational, and societal risks across model development, training, and deployment stages, as well as surveying established and experimental methods for managing these risks. To the best of our knowledge, this paper is the first of its kind to provide extensive documentation of both GPAI risk sources and risk management measures that are descriptive, self-contained and neutral with respect to any existing regulatory framework. This work intends to help AI providers, standards experts, researchers, policymakers, and regulators in identifying and mitigating systemic risks from GPAI systems. For this reason, the catalog is released under a public domain license for ease of direct use by stakeholders in AI governance and standards.

* 91 pages, 8 figures

Via

Access Paper or Ask Questions

Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A Survey

May 02, 2024

Rokas Gipiškis, Chun-Wei Tsai, Olga Kurasova

Figure 1 for Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A Survey

Figure 2 for Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A Survey

Figure 3 for Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A Survey

Figure 4 for Explainable AI (XAI) in Image Segmentation in Medicine, Industry, and Beyond: A Survey

Abstract:Artificial Intelligence (XAI) has found numerous applications in computer vision. While image classification-based explainability techniques have garnered significant attention, their counterparts in semantic segmentation have been relatively neglected. Given the prevalent use of image segmentation, ranging from medical to industrial deployments, these techniques warrant a systematic look. In this paper, we present the first comprehensive survey on XAI in semantic image segmentation. This work focuses on techniques that were either specifically introduced for dense prediction tasks or were extended for them by modifying existing methods in classification. We analyze and categorize the literature based on application categories and domains, as well as the evaluation metrics and datasets used. We also propose a taxonomy for interpretable semantic segmentation, and discuss potential challenges and future research directions.

* 35 pages, 9 figures, 2 tables

Via

Access Paper or Ask Questions