Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Coates

McGill University, Montreal, Canada

Simplifying Graph Transformers

Apr 17, 2025

Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates

Figure 1 for Simplifying Graph Transformers

Figure 2 for Simplifying Graph Transformers

Figure 3 for Simplifying Graph Transformers

Figure 4 for Simplifying Graph Transformers

Abstract:Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L_2$ attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.

Via

Access Paper or Ask Questions

Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Feb 27, 2025

Pavel Rumiantsev, Mark Coates

Figure 1 for Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Figure 2 for Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Figure 3 for Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Figure 4 for Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Abstract:Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network. In the zero-shot version, a fast ranking function is used to compare architectures without training them. The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture's weights' initialization or the batch of data used for calculations. A common approach to addressing the variation is to average a ranking function output over several evaluations. We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric. During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture. Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.

Via

Access Paper or Ask Questions

InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Jan 29, 2025

Didier Chételat, Joseph Cotnareanu, Rylee Thompson, Yingxue Zhang, Mark Coates

Figure 1 for InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Figure 2 for InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Figure 3 for InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Figure 4 for InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Abstract:Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts. Internally, such models process the prompt through multiple transformer layers, building varying representations of the problem within its hidden states. Ultimately, however, only the hidden state corresponding to the final layer and token position are used to predict the answer label. In this work, we propose instead to learn a small separate neural network predictor module on a collection of training questions, that take the hidden states from all the layers at the last temporal position as input and outputs predictions. In effect, such a framework disentangles the representational abilities of LLMs from their predictive abilities. On a collection of hard benchmarks, our method achieves considerable improvements in performance, sometimes comparable to supervised fine-tuning procedures, but at a fraction of the computational cost.

* Accepted at AISTATS 2025

Via

Access Paper or Ask Questions

Secure Federated Graph-Filtering for Recommender Systems

Jan 28, 2025

Julien Nicolas, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar, Mark Coates

Abstract:Recommender systems often rely on graph-based filters, such as normalized item-item adjacency matrices and low-pass filters. While effective, the centralized computation of these components raises concerns about privacy, security, and the ethical use of user data. This work proposes two decentralized frameworks for securely computing these critical graph components without centralizing sensitive information. The first approach leverages lightweight Multi-Party Computation and distributed singular vector computations to privately compute key graph filters. The second extends this framework by incorporating low-rank approximations, enabling a trade-off between communication efficiency and predictive performance. Empirical evaluations on benchmark datasets demonstrate that the proposed methods achieve comparable accuracy to centralized state-of-the-art systems while ensuring data confidentiality and maintaining low communication costs. Our results highlight the potential for privacy-preserving decentralized architectures to bridge the gap between utility and user data protection in modern recommender systems.

Via

Access Paper or Ask Questions

Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Dec 23, 2024

Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye Hao

Figure 1 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Figure 2 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Figure 3 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Figure 4 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Abstract:Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning by decomposing the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a task-agnostic graph that identifies crucial entities, relations, and attributes within the problem context. Subsequently, PoT identifies relevant reasoning chains within the graph corresponding to the posed question, facilitating inference of potential answers. Experimental evaluations on four benchmark datasets, demanding long reasoning chains, demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (maximum 21.3%) without necessitating fine-tuning or extensive LLM calls. Furthermore, as opposed to prior neuro-symbolic methods, PoT exhibits improved resilience against LLM errors by leveraging the compositional nature of graphs.

Via

Access Paper or Ask Questions

Hint Marginalization for Improved Reasoning in Large Language Models

Dec 17, 2024

Soumyasundar Pal, Didier Chételat, Yingxue Zhang, Mark Coates

Abstract:Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to generate a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Hint Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode the most likely answer. Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.

Via

Access Paper or Ask Questions

Differentially private and decentralized randomized power method

Nov 04, 2024

Julien Nicolas, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar, Mark Coates

Figure 1 for Differentially private and decentralized randomized power method

Figure 2 for Differentially private and decentralized randomized power method

Figure 3 for Differentially private and decentralized randomized power method

Figure 4 for Differentially private and decentralized randomized power method

Abstract:The randomized power method has gained significant interest due to its simplicity and efficient handling of large-scale spectral analysis and recommendation tasks. As modern datasets contain sensitive private information, we need to give formal guarantees on the possible privacy leaks caused by this method. This paper focuses on enhancing privacy preserving variants of the method. We propose a strategy to reduce the variance of the noise introduced to achieve Differential Privacy (DP). We also adapt the method to a decentralized framework with a low computational and communication overhead, while preserving the accuracy. We leverage Secure Aggregation (a form of Multi-Party Computation) to allow the algorithm to perform computations using data distributed among multiple users or devices, without revealing individual data. We show that it is possible to use a noise scale in the decentralized setting that is similar to the one in the centralized setting. We improve upon existing convergence bounds for both the centralized and decentralized versions. The proposed method is especially relevant for decentralized applications such as distributed recommender systems, where privacy concerns are paramount.

Via

Access Paper or Ask Questions

Enhancing CTR Prediction in Recommendation Domain with Search Query Representation

Oct 28, 2024

Yuening Wang, Man Chen, Yaochen Hu, Wei Guo, Yingxue Zhang, Huifeng Guo, Yong Liu, Mark Coates

Figure 1 for Enhancing CTR Prediction in Recommendation Domain with Search Query Representation

Figure 2 for Enhancing CTR Prediction in Recommendation Domain with Search Query Representation

Figure 3 for Enhancing CTR Prediction in Recommendation Domain with Search Query Representation

Figure 4 for Enhancing CTR Prediction in Recommendation Domain with Search Query Representation

Abstract:Many platforms, such as e-commerce websites, offer both search and recommendation services simultaneously to better meet users' diverse needs. Recommendation services suggest items based on user preferences, while search services allow users to search for items before providing recommendations. Since users and items are often shared between the search and recommendation domains, there is a valuable opportunity to enhance the recommendation domain by leveraging user preferences extracted from the search domain. Existing approaches either overlook the shift in user intention between these domains or fail to capture the significant impact of learning from users' search queries on understanding their interests. In this paper, we propose a framework that learns from user search query embeddings within the context of user preferences in the recommendation domain. Specifically, user search query sequences from the search domain are used to predict the items users will click at the next time point in the recommendation domain. Additionally, the relationship between queries and items is explored through contrastive learning. To address issues of data sparsity, the diffusion model is incorporated to infer positive items the user will select after searching with certain queries in a denoising manner, which is particularly effective in preventing false positives. Effectively extracting this information, the queries are integrated into click-through rate prediction in the recommendation domain. Experimental analysis demonstrates that our model outperforms state-of-the-art models in the recommendation domain.

* CIKM (2024) 2462-2471
* Accepted by CIKM 2024 Full Research Track

Via

Access Paper or Ask Questions

Dynamic layer selection in decoder-only transformers

Oct 26, 2024

Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates

Abstract:The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.

Via

Access Paper or Ask Questions

Sparse Decomposition of Graph Neural Networks

Oct 25, 2024

Yaochen Hu, Mai Zeng, Ge Zhang, Pavel Rumiantsev, Liheng Ma, Yingxue Zhang, Mark Coates

Figure 1 for Sparse Decomposition of Graph Neural Networks

Figure 2 for Sparse Decomposition of Graph Neural Networks

Figure 3 for Sparse Decomposition of Graph Neural Networks

Figure 4 for Sparse Decomposition of Graph Neural Networks

Abstract:Graph Neural Networks (GNN) exhibit superior performance in graph representation learning, but their inference cost can be high, due to an aggregation operation that can require a memory fetch for a very large number of nodes. This inference cost is the major obstacle to deploying GNN models with \emph{online prediction} to reflect the potentially dynamic node features. To address this, we propose an approach to reduce the number of nodes that are included during aggregation. We achieve this through a sparse decomposition, learning to approximate node representations using a weighted sum of linearly transformed features of a carefully selected subset of nodes within the extended neighbourhood. The approach achieves linear complexity with respect to the average node degree and the number of layers in the graph neural network. We introduce an algorithm to compute the optimal parameters for the sparse decomposition, ensuring an accurate approximation of the original GNN model, and present effective strategies to reduce the training time and improve the learning process. We demonstrate via extensive experiments that our method outperforms other baselines designed for inference speedup, achieving significant accuracy gains with comparable inference times for both node classification and spatio-temporal forecasting tasks.

Via

Access Paper or Ask Questions