Alert button
Picture for Shengjie Luo

Shengjie Luo

Alert button

Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers

Feb 03, 2023
Krzysztof Marcin Choromanski, Shanda Li, Valerii Likhosherstov, Kumar Avinava Dubey, Shengjie Luo, Di He, Yiming Yang, Tamas Sarlos, Thomas Weingarten, Adrian Weller

Figure 1 for Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers
Figure 2 for Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers
Figure 3 for Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers
Figure 4 for Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers

We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for nongeometric data, as well as novel RPEs operating on the sequences of tokens embedded in higher-dimensional Euclidean spaces (e.g. point clouds). FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE-mask. FLTs allow also for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and providing accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly tested FLTs on other data modalities and tasks, such as: image classification and 3D molecular modeling. For 3D-data FLTs are, to the best of our knowledge, the first Transformers architectures providing RPE-enhanced linear attention.

Viaarxiv icon

Rethinking the Expressive Power of GNNs via Graph Biconnectivity

Jan 23, 2023
Bohang Zhang, Shengjie Luo, Liwei Wang, Di He

Figure 1 for Rethinking the Expressive Power of GNNs via Graph Biconnectivity
Figure 2 for Rethinking the Expressive Power of GNNs via Graph Biconnectivity
Figure 3 for Rethinking the Expressive Power of GNNs via Graph Biconnectivity
Figure 4 for Rethinking the Expressive Power of GNNs via Graph Biconnectivity

Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.

* ICLR 2023 notable top-5%; 58 pages, 11 figures 
Viaarxiv icon

One Transformer Can Understand Both 2D & 3D Molecular Data

Oct 04, 2022
Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

Figure 1 for One Transformer Can Understand Both 2D & 3D Molecular Data
Figure 2 for One Transformer Can Understand Both 2D & 3D Molecular Data
Figure 3 for One Transformer Can Understand Both 2D & 3D Molecular Data
Figure 4 for One Transformer Can Understand Both 2D & 3D Molecular Data

Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M.

* Preprint. Work in Progress. Code: https://github.com/lsj2408/Transformer-M 
Viaarxiv icon

Your Transformer May Not be as Powerful as You Expect

May 26, 2022
Shengjie Luo, Shanda Li, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

Figure 1 for Your Transformer May Not be as Powerful as You Expect
Figure 2 for Your Transformer May Not be as Powerful as You Expect
Figure 3 for Your Transformer May Not be as Powerful as You Expect
Figure 4 for Your Transformer May Not be as Powerful as You Expect

Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications.

* Preprint. Work in Progress 
Viaarxiv icon

An Empirical Study of Graphormer on Large-Scale Molecular Modeling Datasets

Mar 14, 2022
Yu Shi, Shuxin Zheng, Guolin Ke, Yifei Shen, Jiacheng You, Jiyan He, Shengjie Luo, Chang Liu, Di He, Tie-Yan Liu

Figure 1 for An Empirical Study of Graphormer on Large-Scale Molecular Modeling Datasets
Figure 2 for An Empirical Study of Graphormer on Large-Scale Molecular Modeling Datasets
Figure 3 for An Empirical Study of Graphormer on Large-Scale Molecular Modeling Datasets
Figure 4 for An Empirical Study of Graphormer on Large-Scale Molecular Modeling Datasets

This technical note describes the recent updates of Graphormer, including architecture design modifications, and the adaption to 3D molecular dynamics simulation. The "Graphormer-V2" could attain better results on large-scale molecular modeling datasets than the vanilla one, and the performance gain could be consistently obtained on downstream tasks. In addition, we show that with a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs. Graphormer-V2 achieves much less MAE than the vanilla Graphormer on the PCQM4M quantum chemistry dataset used in KDD Cup 2021, where the latter one won the first place in this competition. In the meanwhile, Graphormer-V2 greatly outperforms the competitors in the recent Open Catalyst Challenge, which is a competition track on NeurIPS 2021 workshop, and aims to model the catalyst-adsorbate reaction system with advanced AI models. All models could be found at \url{https://github.com/Microsoft/Graphormer}.

* Wrong dual-submission (arXiv:2203.04810) with negligently 
Viaarxiv icon

Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets

Mar 09, 2022
Yu Shi, Shuxin Zheng, Guolin Ke, Yifei Shen, Jiacheng You, Jiyan He, Shengjie Luo, Chang Liu, Di He, Tie-Yan Liu

Figure 1 for Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets
Figure 2 for Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets
Figure 3 for Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets

This technical note describes the recent updates of Graphormer, including architecture design modifications, and the adaption to 3D molecular dynamics simulation. With these simple modifications, Graphormer could attain better results on large-scale molecular modeling datasets than the vanilla one, and the performance gain could be consistently obtained on 2D and 3D molecular graph modeling tasks. In addition, we show that with a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs. Empirically, Graphormer could achieve much less MAE than the originally reported results on the PCQM4M quantum chemistry dataset used in KDD Cup 2021. In the meanwhile, it greatly outperforms the competitors in the recent Open Catalyst Challenge, which is a competition track on NeurIPS 2021 workshop, and aims to model the catalyst-adsorbate reaction system with advanced AI models. All codes could be found at https://github.com/Microsoft/Graphormer.

Viaarxiv icon

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Jun 23, 2021
Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu

Figure 1 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
Figure 2 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
Figure 3 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
Figure 4 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

* Preprint. Work in Progress 
Viaarxiv icon

First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

Jun 20, 2021
Chengxuan Ying, Mingqi Yang, Shuxin Zheng, Guolin Ke, Shengjie Luo, Tianle Cai, Chenglin Wu, Yuxin Wang, Yanming Shen, Di He

Figure 1 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track
Figure 2 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track
Figure 3 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track
Figure 4 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

In this technical report, we present our solution of KDD Cup 2021 OGB Large-Scale Challenge - PCQM4M-LSC Track. We adopt Graphormer and ExpC as our basic models. We train each model by 8-fold cross-validation, and additionally train two Graphormer models on the union of training and validation sets with different random seeds. For final submission, we use a naive ensemble for these 18 models by taking average of their outputs. Using our method, our team MachineLearning achieved 0.1200 MAE on test set, which won the first place in KDD Cup graph prediction track.

Viaarxiv icon

Do Transformers Really Perform Bad for Graph Representation?

Jun 17, 2021
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu

Figure 1 for Do Transformers Really Perform Bad for Graph Representation?
Figure 2 for Do Transformers Really Perform Bad for Graph Representation?
Figure 3 for Do Transformers Really Perform Bad for Graph Representation?
Figure 4 for Do Transformers Really Perform Bad for Graph Representation?

The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.

Viaarxiv icon