Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vinay Kumar Sankarapu

Interpretability as Alignment: Making Internal Understanding a Design Principle

Sep 10, 2025

Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu

Figure 1 for Interpretability as Alignment: Making Internal Understanding a Design Principle

Figure 2 for Interpretability as Alignment: Making Internal Understanding a Design Principle

Figure 3 for Interpretability as Alignment: Making Internal Understanding a Design Principle

Figure 4 for Interpretability as Alignment: Making Internal Understanding a Design Principle

Abstract:Large neural models are increasingly deployed in high-stakes settings, raising concerns about whether their behavior reliably aligns with human values. Interpretability provides a route to internal transparency by revealing the computations that drive outputs. We argue that interpretability especially mechanistic approaches should be treated as a design principle for alignment, not an auxiliary diagnostic tool. Post-hoc methods such as LIME or SHAP offer intuitive but correlational explanations, while mechanistic techniques like circuit tracing or activation patching yield causal insight into internal failures, including deceptive or misaligned reasoning that behavioral methods like RLHF, red teaming, or Constitutional AI may overlook. Despite these advantages, interpretability faces challenges of scalability, epistemic uncertainty, and mismatches between learned representations and human concepts. Our position is that progress on safe and trustworthy AI will depend on making interpretability a first-class objective of AI research and development, ensuring that systems are not only effective but also auditable, transparent, and aligned with human intent.

* Pre-Print

Via

Access Paper or Ask Questions

Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

Feb 07, 2025

Pratinav Seth, Vinay Kumar Sankarapu

Figure 1 for Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

Figure 2 for Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

Figure 3 for Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

Abstract:This position paper emphasizes the critical gap in the evaluation of Explainable AI (XAI) due to the lack of standardized and reliable metrics, which diminishes its practical value, trustworthiness, and ability to meet regulatory requirements. Current evaluation methods are often fragmented, subjective, and biased, making them prone to manipulation and complicating the assessment of complex models. A central issue is the absence of a ground truth for explanations, complicating comparisons across various XAI approaches. To address these challenges, we advocate for widespread research into developing robust, context-sensitive evaluation metrics. These metrics should be resistant to manipulation, relevant to each use case, and based on human judgment and real-world applicability. We also recommend creating domain-specific evaluation benchmarks that align with the user and regulatory needs of sectors such as healthcare and finance. By encouraging collaboration among academia, industry, and regulators, we can create standards that balance flexibility and consistency, ensuring XAI explanations are meaningful, trustworthy, and compliant with evolving regulations.

Via

Access Paper or Ask Questions

xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods

Feb 05, 2025

Pratinav Seth, Yashwardhan Rathore, Neeraj Kumar Singh, Chintan Chitroda, Vinay Kumar Sankarapu

Figure 1 for xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods

Figure 2 for xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods

Figure 3 for xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods

Figure 4 for xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods

Abstract:The growing complexity of machine learning and deep learning models has led to an increased reliance on opaque "black box" systems, making it difficult to understand the rationale behind predictions. This lack of transparency is particularly challenging in high-stakes applications where interpretability is as important as accuracy. Post-hoc explanation methods are commonly used to interpret these models, but they are seldom rigorously evaluated, raising concerns about their reliability. The Python package xai_evals addresses this by providing a comprehensive framework for generating, benchmarking, and evaluating explanation methods across both tabular and image data modalities. It integrates popular techniques like SHAP, LIME, Grad-CAM, Integrated Gradients (IG), and Backtrace, while supporting evaluation metrics such as faithfulness, sensitivity, and robustness. xai_evals enhances the interpretability of machine learning models, fostering transparency and trust in AI systems. The library is open-sourced at https://pypi.org/project/xai-evals/ .

Via

Access Paper or Ask Questions

DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models

Nov 19, 2024

Vinay Kumar Sankarapu, Chintan Chitroda, Yashwardhan Rathore, Neeraj Kumar Singh, Pratinav Seth

Figure 1 for DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models

Figure 2 for DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models

Figure 3 for DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models

Figure 4 for DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models

Abstract:The rapid advancement of artificial intelligence has led to increasingly sophisticated deep learning models, which frequently operate as opaque 'black boxes' with limited transparency in their decision-making processes. This lack of interpretability presents considerable challenges, especially in high-stakes applications where understanding the rationale behind a model's outputs is as essential as the outputs themselves. This study addresses the pressing need for interpretability in AI systems, emphasizing its role in fostering trust, ensuring accountability, and promoting responsible deployment in mission-critical fields. To address the interpretability challenge in deep learning, we introduce DLBacktrace, an innovative technique developed by the AryaXAI team to illuminate model decisions across a wide array of domains, including simple Multi Layer Perceptron (MLPs), Convolutional Neural Networks (CNNs), Large Language Models (LLMs), Computer Vision Models, and more. We provide a comprehensive overview of the DLBacktrace algorithm and present benchmarking results, comparing its performance against established interpretability methods, such as SHAP, LIME, GradCAM, Integrated Gradients, SmoothGrad, and Attention Rollout, using diverse task-based metrics. The proposed DLBacktrace technique is compatible with various model architectures built in PyTorch and TensorFlow, supporting models like Llama 3.2, other NLP architectures such as BERT and LSTMs, computer vision models like ResNet and U-Net, as well as custom deep neural network (DNN) models for tabular data. This flexibility underscores DLBacktrace's adaptability and effectiveness in enhancing model transparency across a broad spectrum of applications. The library is open-sourced and available at https://github.com/AryaXAI/DLBacktrace .

Via

Access Paper or Ask Questions