Faithfully summarizing the knowledge encoded by a deep neural network (DNN) into a few symbolic primitive patterns without losing much information represents a core challenge in explainable AI. To this end, Ren et al. (2023c) have derived a series of theorems to prove that the inference score of a DNN can be explained as a small set of interactions between input variables. However, the lack of generalization power makes it still hard to consider such interactions as faithful primitive patterns encoded by the DNN. Therefore, given different DNNs trained for the same task, we develop a new method to extract interactions that are shared by these DNNs. Experiments show that the extracted interactions can better reflect common knowledge shared by different DNNs.
The AI model has surpassed human players in the game of Go, and it is widely believed that the AI model has encoded new knowledge about the Go game beyond human players. In this way, explaining the knowledge encoded by the AI model and using it to teach human players represent a promising-yet-challenging issue in explainable AI. To this end, mathematical supports are required to ensure that human players can learn accurate and verifiable knowledge, rather than specious intuitive analysis. Thus, in this paper, we extract interaction primitives between stones encoded by the value network for the Go game, so as to enable people to learn from the value network. Experiments show the effectiveness of our method.
This paper aims to develop a new attribution method to explain the conflict between individual variables' attributions and their coalition's attribution from a fully new perspective. First, we find that the Shapley value can be reformulated as the allocation of Harsanyi interactions encoded by the AI model. Second, based the re-alloction of interactions, we extend the Shapley value to the attribution of coalitions. Third we ective. We derive the fundamental mechanism behind the conflict. This conflict come from the interaction containing partial variables in their coalition.
This paper aims to prove the emergence of symbolic concepts in well-trained AI models. We prove that if (1) the high-order derivatives of the model output w.r.t. the input variables are all zero, (2) the AI model can be used on occluded samples and will yield higher confidence when the input sample is less occluded, and (3) the confidence of the AI model does not significantly degrade on occluded samples, then the AI model will encode sparse interactive concepts. Each interactive concept represents an interaction between a specific set of input variables, and has a certain numerical effect on the inference score of the model. Specifically, it is proved that the inference score of the model can always be represented as the sum of the interaction effects of all interactive concepts. In fact, we hope to prove that conditions for the emergence of symbolic concepts are quite common. It means that for most AI models, we can usually use a small number of interactive concepts to mimic the model outputs on any arbitrarily masked samples.
In this technical note, we aim to explain a deep neural network (DNN) by quantifying the encoded interactions between input variables, which reflects the DNN's inference logic. Specifically, we first rethink the definition of interactions, and then formally define faithfulness and conciseness for interaction-based explanation. To this end, we propose two kinds of interactions, i.e., the AND interaction and the OR interaction. For faithfulness, we prove the uniqueness of the AND (OR) interaction in quantifying the effect of the AND (OR) relationship between input variables. Besides, based on AND-OR interactions, we design techniques to boost the conciseness of the explanation, while not hurting the faithfulness. In this way, the inference logic of a DNN can be faithfully and concisely explained by a set of symbolic concepts.
The Shapley value is widely regarded as a trustworthy attribution metric. However, when people use Shapley values to explain the attribution of input variables of a deep neural network (DNN), it usually requires a very high computational cost to approximate relatively accurate Shapley values in real-world applications. Therefore, we propose a novel network architecture, the HarsanyiNet, which makes inferences on the input sample and simultaneously computes the exact Shapley values of the input variables in a single forward propagation. The HarsanyiNet is designed on the theoretical foundation that the Shapley value can be reformulated as the redistribution of Harsanyi interactions encoded by the network.
In this paper, we explain the inference logic of large language models (LLMs) as a set of symbolic concepts. Many recent studies have discovered that traditional DNNs usually encode sparse symbolic concepts. However, because an LLM has much more parameters than traditional DNNs, whether the LLM also encodes sparse symbolic concepts is still an open problem. Therefore, in this paper, we propose to disentangle the inference score of LLMs for dialogue tasks into a small number of symbolic concepts. We verify that we can use those sparse concepts to well estimate all inference scores of the LLM on all arbitrarily masking states of the input sentence. We also evaluate the transferability of concepts encoded by an LLM and verify that symbolic concepts usually exhibit high transferability across similar input sentences. More crucially, those symbolic concepts can be used to explain the exact reasons accountable for the LLM's prediction errors.
Various attribution methods have been developed to explain deep neural networks (DNNs) by inferring the attribution/importance/contribution score of each input variable to the final output. However, existing attribution methods are often built upon different heuristics. There remains a lack of a unified theoretical understanding of why these methods are effective and how they are related. To this end, for the first time, we formulate core mechanisms of fourteen attribution methods, which were designed on different heuristics, into the same mathematical system, i.e., the system of Taylor interactions. Specifically, we prove that attribution scores estimated by fourteen attribution methods can all be reformulated as the weighted sum of two types of effects, i.e., independent effects of each individual input variable and interaction effects between input variables. The essential difference among the fourteen attribution methods mainly lies in the weights of allocating different effects. Based on the above findings, we propose three principles for a fair allocation of effects to evaluate the faithfulness of the fourteen attribution methods.
In this paper, we focus on mean-field variational Bayesian Neural Networks (BNNs) and explore the representation capacity of such BNNs by investigating which types of concepts are less likely to be encoded by the BNN. It has been observed and studied that a relatively small set of interactive concepts usually emerge in the knowledge representation of a sufficiently-trained neural network, and such concepts can faithfully explain the network output. Based on this, our study proves that compared to standard deep neural networks (DNNs), it is less likely for BNNs to encode complex concepts. Experiments verify our theoretical proofs. Note that the tendency to encode less complex concepts does not necessarily imply weak representation power, considering that complex concepts exhibit low generalization power and high adversarial vulnerability.
This paper explains the generalization power of a deep neural network (DNN) from the perspective of interactive concepts. Many recent studies have quantified a clear emergence of interactive concepts encoded by the DNN, which have been observed on different DNNs during the learning process. Therefore, in this paper, we investigate the generalization power of each interactive concept, and we use the generalization power of different interactive concepts to explain the generalization power of the entire DNN. Specifically, we define the complexity of each interactive concept. We find that simple concepts can be better generalized to testing data than complex concepts. The DNN with strong generalization power usually learns simple concepts more quickly and encodes fewer complex concepts. More crucially, we discover the detouring dynamics of learning complex concepts, which explain both the high learning difficulty and the low generalization power of complex concepts.