Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Di He

Boosting the Certified Robustness of L-infinity Distance Nets

Oct 13, 2021

Bohang Zhang, Du Jiang, Di He, Liwei Wang

Figure 1 for Boosting the Certified Robustness of L-infinity Distance Nets

Figure 2 for Boosting the Certified Robustness of L-infinity Distance Nets

Figure 3 for Boosting the Certified Robustness of L-infinity Distance Nets

Figure 4 for Boosting the Certified Robustness of L-infinity Distance Nets

Abstract:Recently, Zhang et al. (2021) developed a new neural network architecture based on $\ell_\infty$-distance functions, which naturally possesses certified robustness by its construction. Despite the excellent theoretical properties, the model so far can only achieve comparable performance to conventional networks. In this paper, we significantly boost the certified robustness of $\ell_\infty$-distance nets through a careful analysis of its training process. In particular, we show the $\ell_p$-relaxation, a crucial way to overcome the non-smoothness of the model, leads to an unexpected large Lipschitz constant at the early training stage. This makes the optimization insufficient using hinge loss and produces sub-optimal solutions. Given these findings, we propose a simple approach to address the issues above by using a novel objective function that combines a scaled cross-entropy loss with clipped hinge loss. Our experiments show that using the proposed training strategy, the certified accuracy of $\ell_\infty$-distance net can be dramatically improved from 33.30% to 40.06% on CIFAR-10 ($\epsilon=8/255$), meanwhile significantly outperforming other approaches in this area. Such a result clearly demonstrates the effectiveness and potential of $\ell_\infty$-distance net for certified robustness.

Via

Access Paper or Ask Questions

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Jun 23, 2021

Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu

Figure 1 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Figure 2 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Figure 3 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Figure 4 for Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Abstract:The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

* Preprint. Work in Progress

Via

Access Paper or Ask Questions

First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

Jun 20, 2021

Chengxuan Ying, Mingqi Yang, Shuxin Zheng, Guolin Ke, Shengjie Luo, Tianle Cai, Chenglin Wu, Yuxin Wang, Yanming Shen, Di He

Figure 1 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

Figure 2 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

Figure 3 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

Figure 4 for First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track

Abstract:In this technical report, we present our solution of KDD Cup 2021 OGB Large-Scale Challenge - PCQM4M-LSC Track. We adopt Graphormer and ExpC as our basic models. We train each model by 8-fold cross-validation, and additionally train two Graphormer models on the union of training and validation sets with different random seeds. For final submission, we use a naive ensemble for these 18 models by taking average of their outputs. Using our method, our team MachineLearning achieved 0.1200 MAE on test set, which won the first place in KDD Cup graph prediction track.

Via

Access Paper or Ask Questions

Do Transformers Really Perform Bad for Graph Representation?

Jun 17, 2021

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu

Figure 1 for Do Transformers Really Perform Bad for Graph Representation?

Figure 2 for Do Transformers Really Perform Bad for Graph Representation?

Figure 3 for Do Transformers Really Perform Bad for Graph Representation?

Figure 4 for Do Transformers Really Perform Bad for Graph Representation?

Abstract:The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.

Via

Access Paper or Ask Questions

How could Neural Networks understand Programs?

May 31, 2021

Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu

Figure 1 for How could Neural Networks understand Programs?

Figure 2 for How could Neural Networks understand Programs?

Figure 3 for How could Neural Networks understand Programs?

Figure 4 for How could Neural Networks understand Programs?

Abstract:Semantic understanding of programs is a fundamental problem for programming language processing (PLP). Recent works that learn representations of code based on pre-training techniques in NLP have pushed the frontiers in this direction. However, the semantics of PL and NL have essential differences. These being ignored, we believe it is difficult to build a model to better understand programs, by either directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model by the heuristic. In fact, the semantics of a program can be rigorously defined by formal semantics in PL theory. For example, the operational semantics, describes the meaning of a valid program as updating the environment (i.e., the memory address-value function) through fundamental operations, such as memory I/O and conditional branching. Inspired by this, we propose a novel program semantics learning paradigm, that the model should learn from information composed of (1) the representations which align well with the fundamental operations in operational semantics, and (2) the information of environment transition, which is indispensable for program understanding. To validate our proposal, we present a hierarchical Transformer-based pre-training model called OSCAR to better facilitate the understanding of programs. OSCAR learns from intermediate representation (IR) and an encoded representation derived from static analysis, which are used for representing the fundamental operations and approximating the environment transitions respectively. OSCAR empirically shows the outstanding capability of program semantics understanding on many practical software engineering tasks.

* ICML 2021

Via

Access Paper or Ask Questions

Adversarial Training with Rectified Rejection

May 31, 2021

Tianyu Pang, Huishuai Zhang, Di He, Yinpeng Dong, Hang Su, Wei Chen, Jun Zhu, Tie-Yan Liu

Figure 1 for Adversarial Training with Rectified Rejection

Figure 2 for Adversarial Training with Rectified Rejection

Figure 3 for Adversarial Training with Rectified Rejection

Figure 4 for Adversarial Training with Rectified Rejection

Abstract:Adversarial training (AT) is one of the most effective strategies for promoting model robustness, whereas even the state-of-the-art adversarially trained models struggle to exceed 60% robust test accuracy on CIFAR-10 without additional data, which is far from practical. A natural way to break this accuracy bottleneck is to introduce a rejection option, where confidence is a commonly used certainty proxy. However, the vanilla confidence can overestimate the model certainty if the input is wrongly classified. To this end, we propose to use true confidence (T-Con) (i.e., predicted probability of the true class) as a certainty oracle, and learn to predict T-Con by rectifying confidence. We prove that under mild conditions, a rectified confidence (R-Con) rejector and a confidence rejector can be coupled to distinguish any wrongly classified input from correctly classified ones, even under adaptive attacks. We also quantify that training R-Con to be aligned with T-Con could be an easier task than learning robust classifiers. In our experiments, we evaluate our rectified rejection (RR) module on CIFAR-10, CIFAR-10-C, and CIFAR-100 under several attacks, and demonstrate that the RR module is well compatible with different AT frameworks on improving robustness, with little extra computation.

Via

Access Paper or Ask Questions

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Mar 09, 2021

Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas

Figure 1 for Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Figure 2 for Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Figure 3 for Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Figure 4 for Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Abstract:Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0

Via

Access Paper or Ask Questions

Transformers with Competitive Ensembles of Independent Mechanisms

Feb 27, 2021

Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, Yoshua Bengio

Figure 1 for Transformers with Competitive Ensembles of Independent Mechanisms

Figure 2 for Transformers with Competitive Ensembles of Independent Mechanisms

Figure 3 for Transformers with Competitive Ensembles of Independent Mechanisms

Figure 4 for Transformers with Competitive Ensembles of Independent Mechanisms

Abstract:An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

* Under Review, ICML 2021

Via

Access Paper or Ask Questions

LazyFormer: Self Attention with Lazy Update

Feb 25, 2021

Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu

Figure 1 for LazyFormer: Self Attention with Lazy Update

Figure 2 for LazyFormer: Self Attention with Lazy Update

Figure 3 for LazyFormer: Self Attention with Lazy Update

Figure 4 for LazyFormer: Self Attention with Lazy Update

Abstract:Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method.

Via

Access Paper or Ask Questions

Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Feb 18, 2021

Shuqi Lu, Chenyan Xiong, Di He, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tieyan Liu, Arnold Overwijk

Figure 1 for Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Figure 2 for Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Figure 3 for Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Figure 4 for Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Abstract:Many real-world applications use Siamese networks to efficiently match text sequences at scale, which require high-quality sequence encodings. This paper pre-trains language models dedicated to sequence matching in Siamese architectures. We first hypothesize that a representation is better for sequence matching if the entire sequence can be reconstructed from it, which, however, is unlikely to be achieved in standard autoencoders: A strong decoder can rely on its capacity and natural language patterns to reconstruct and bypass the needs of better sequence encodings. Therefore we propose a new self-learning method that pretrains the encoder with a weak decoder, which reconstructs the original sequence from the encoder's [CLS] representations but is restricted in both capacity and attention span. In our experiments on web search and recommendation, the pre-trained SEED-Encoder, "SiamEsE oriented encoder by reconstructing from weak decoder", shows significantly better generalization ability when fine-tuned in Siamese networks, improving overall accuracy and few-shot performances. Our code and models will be released.

Via

Access Paper or Ask Questions