Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuangfei Zhai

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Mar 11, 2023

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind

Abstract:Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as $\textit{entropy collapse}$. As a remedy, we propose $\sigma$Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that the proposed reparameterization successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with $\sigma$Reparam on image classification, image self-supervised learning, machine translation, automatic speech recognition, and language modeling tasks, across Transformer architectures. We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.

Via

Access Paper or Ask Questions

TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

Mar 07, 2023

David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbott, Eric Gu

Figure 1 for TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

Figure 2 for TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

Figure 3 for TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

Figure 4 for TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

Abstract:Denoising Diffusion models have demonstrated their proficiency for generative sampling. However, generating good samples often requires many iterations. Consequently, techniques such as binary time-distillation (BTD) have been proposed to reduce the number of network calls for a fixed architecture. In this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new method that extends BTD. For single step diffusion,TRACT improves FID by up to 2.4x on the same architecture, and achieves new single-step Denoising Diffusion Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for CIFAR10). Finally we tease apart the method through extended ablations. The PyTorch implementation will be released soon.

Via

Access Paper or Ask Questions

f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Oct 10, 2022

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Miguel Angel Bautista, Josh Susskind

Figure 1 for f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Figure 2 for f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Figure 3 for f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Figure 4 for f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Abstract:Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation limits DMs from changing the latent spaces and learning abstract representations. In this work, we propose f-DM, a generalized family of DMs which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation and derive the corresponding de-noising objective with a modified sampling algorithm. As a demonstration, we apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f-DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and semantic interpretation.

* 28 pages, 21 figures, work in progress

Via

Access Paper or Ask Questions

GAUDI: A Neural Architect for Immersive 3D Scene Generation

Jul 27, 2022

Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht(+2 more)

Figure 1 for GAUDI: A Neural Architect for Immersive 3D Scene Generation

Figure 2 for GAUDI: A Neural Architect for Immersive 3D Scene Generation

Figure 3 for GAUDI: A Neural Architect for Immersive 3D Scene Generation

Figure 4 for GAUDI: A Neural Architect for Immersive 3D Scene Generation

Abstract:We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generative model that enables both unconditional and conditional generation of 3D scenes. Our model generalizes previous works that focus on single objects by removing the assumption that the camera pose distribution can be shared across samples. We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets and allows for conditional generation of 3D scenes given conditioning variables like sparse image observations or text that describes the scene.

* Project webpage: https://github.com/apple/ml-gaudi

Via

Access Paper or Ask Questions

Position Prediction as an Effective Pretraining Strategy

Jul 15, 2022

Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua Susskind

Figure 1 for Position Prediction as an Effective Pretraining Strategy

Figure 2 for Position Prediction as an Effective Pretraining Strategy

Figure 3 for Position Prediction as an Effective Pretraining Strategy

Figure 4 for Position Prediction as an Effective Pretraining Strategy

Abstract:Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to learn about relationships between the content in different parts of the input using autoencoding related objectives. In this paper, we propose a novel, but surprisingly simple alternative to content reconstruction~-- that of predicting locations from content, without providing positional information for it. Doing so requires the Transformer to understand the positional relationships between different parts of the input, from their content alone. This amounts to an efficient implementation where the pretext task is a classification problem among all possible positions for each input token. We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. Our method also enables Transformers trained without position embeddings to outperform ones trained with full position information.

* Accepted to ICML 2022

Via

Access Paper or Ask Questions

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Jun 13, 2022

Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, Joshua Susskind

Figure 1 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Figure 2 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Figure 3 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Figure 4 for The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

Abstract:The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 ) refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking as reported in ( arXiv:2201.02177 ) almost exclusively happens at the onset of Slingshots, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin.

* Removed Tex formatting commands in title Title and Abstract

Via

Access Paper or Ask Questions

Learning Representation from Neural Fisher Kernel with Low-rank Approximation

Feb 04, 2022

Ruixiang Zhang, Shuangfei Zhai, Etai Littwin, Josh Susskind

Figure 1 for Learning Representation from Neural Fisher Kernel with Low-rank Approximation

Figure 2 for Learning Representation from Neural Fisher Kernel with Low-rank Approximation

Figure 3 for Learning Representation from Neural Fisher Kernel with Low-rank Approximation

Figure 4 for Learning Representation from Neural Fisher Kernel with Low-rank Approximation

Abstract:In this paper, we study the representation of neural networks from the view of kernels. We first define the Neural Fisher Kernel (NFK), which is the Fisher Kernel applied to neural networks. We show that NFK can be computed for both supervised and unsupervised learning models, which can serve as a unified tool for representation extraction. Furthermore, we show that practical NFKs exhibit low-rank structures. We then propose an efficient algorithm that computes a low rank approximation of NFK, which scales to large datasets and networks. We show that the low-rank approximation of NFKs derived from unsupervised generative models and supervised learning models gives rise to high-quality compact representations of data, achieving competitive results on a variety of machine learning tasks.

Via

Access Paper or Ask Questions

Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Dec 02, 2021

Nitish Srivastava, Walter Talbott, Martin Bertran Lopez, Shuangfei Zhai, Josh Susskind

Figure 1 for Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Figure 2 for Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Figure 3 for Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Figure 4 for Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Abstract:Modeling the world can benefit robot learning by providing a rich training signal for shaping an agent's latent state space. However, learning world models in unconstrained environments over high-dimensional observation spaces such as images is challenging. One source of difficulty is the presence of irrelevant but hard-to-model background distractions, and unimportant visual details of task-relevant entities. We address this issue by learning a recurrent latent dynamics model which contrastively predicts the next observation. This simple model leads to surprisingly robust robotic control even with simultaneous camera, background, and color distractions. We outperform alternatives such as bisimulation methods which impose state-similarity measures derived from divergence in future reward or future optimal actions. We obtain state-of-the-art results on the Distracting Control Suite, a challenging benchmark for pixel-based robotic control.

* NeurIPS Deep Reinforcement Learning Workshop 2021. Code can be found at https://github.com/apple/ml-core

Via

Access Paper or Ask Questions

Regularized Training of Nearest Neighbor Language Models

Sep 16, 2021

Jean-Francois Ton, Walter Talbott, Shuangfei Zhai, Josh Susskind

Figure 1 for Regularized Training of Nearest Neighbor Language Models

Figure 2 for Regularized Training of Nearest Neighbor Language Models

Figure 3 for Regularized Training of Nearest Neighbor Language Models

Figure 4 for Regularized Training of Nearest Neighbor Language Models

Abstract:Including memory banks in a natural language processing architecture increases model capacity by equipping it with additional data at inference time. In this paper, we build upon $k$NN-LM \citep{khandelwal20generalization}, which uses a pre-trained language model together with an exhaustive $k$NN search through the training data (memory bank) to achieve state-of-the-art results. We investigate whether we can improve the $k$NN-LM performance by instead training a LM with the knowledge that we will be using a $k$NN post-hoc. We achieved significant improvement using our method on language modeling tasks on \texttt{WIKI-2} and \texttt{WIKI-103}. The main phenomenon that we encounter is that adding a simple L2 regularization on the activations (not weights) of the model, a transformer, improves the post-hoc $k$NN classification performance. We explore some possible reasons for this improvement. In particular, we find that the added L2 regularization seems to improve the performance for high-frequency words without deteriorating the performance for low frequency ones.

Via

Access Paper or Ask Questions

Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Jul 02, 2021

Etai Littwin, Omid Saremi, Shuangfei Zhai, Vimal Thilak, Hanlin Goh, Joshua M. Susskind, Greg Yang

Figure 1 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Figure 2 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Figure 3 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Figure 4 for Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Abstract:We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck. Unlike the neural tangent kernel limit, a bottleneck in an otherwise infinite width network al-lows data dependent feature learning in its bottle-neck representation. We empirically show that a single bottleneck in infinite networks dramatically accelerates training when compared to purely in-finite networks, with an improved overall performance. We discuss the acceleration phenomena by drawing similarities to infinitely wide deep linear models, where the acceleration effect of a bottleneck can be understood theoretically.

Via

Access Paper or Ask Questions