Abstract:With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
Abstract:Natural language explanations in recommender systems are often framed as a review generation task, leveraging user reviews as ground-truth supervision. While convenient, this approach conflates a user's opinion with the system's reasoning, leading to explanations that may be fluent but fail to reflect the true logic behind recommendations. In this work, we revisit the core objective of explainable recommendation: to transparently communicate why an item is recommended by linking user needs to relevant item features. Through a comprehensive analysis of existing methods across multiple benchmark datasets, we identify common limitations-explanations that are weakly aligned with model predictions, vague or inaccurate in identifying user intents, and overly repetitive or generic. To overcome these challenges, we propose FIRE, a lightweight and interpretable framework that combines SHAP-based feature attribution with structured, prompt-driven language generation. FIRE produces faithful, diverse, and user-aligned explanations, grounded in the actual decision-making process of the model. Our results demonstrate that FIRE not only achieves competitive recommendation accuracy but also significantly improves explanation quality along critical dimensions such as alignment, structure, and faithfulness. This work highlights the need to move beyond the review-as-explanation paradigm and toward explanation methods that are both accountable and interpretable.