Abstract:Evaluating the quality of post-hoc explanations for Graph Neural Networks (GNNs) remains a significant challenge. While recent years have seen an increasing development of explainability methods, current evaluation metrics (e.g., fidelity, sparsity) often fail to assess whether an explanation identifies the true underlying causal variables. To address this, we propose the Explanation-Generalization Score (EGS), a metric that quantifies the causal relevance of GNN explanations. EGS is founded on the principle of feature invariance and posits that if an explanation captures true causal drivers, it should lead to stable predictions across distribution shifts. To quantify this, we introduce a framework that trains GNNs using explanatory subgraphs and evaluates their performance in Out-of-Distribution (OOD) settings (here, OOD generalization serves as a rigorous proxy for the explanation's causal validity). Through large-scale validation involving 11,200 model combinations across synthetic and real-world datasets, our results demonstrate that EGS provides a principled benchmark for ranking explainers based on their ability to capture causal substructures, offering a robust alternative to traditional fidelity-based metrics.
Abstract:In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 -> 43.30) and +0.70 on the challenge set (44.90 -> 45.60), +0.60 for English-Odia on the evaluation set (41.00 -> 41.60), and +0.10 for English-Hindi on the challenge set (53.90 -> 54.00).