Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ahmed Salah

Tracing the Path to Grokking: Embeddings, Dropout, and Network Activation

Jul 15, 2025

Ahmed Salah, David Yevick

Abstract:Grokking refers to delayed generalization in which the increase in test accuracy of a neural network occurs appreciably after the improvement in training accuracy This paper introduces several practical metrics including variance under dropout, robustness, embedding similarity, and sparsity measures, that can forecast grokking behavior. Specifically, the resilience of neural networks to noise during inference is estimated from a Dropout Robustness Curve (DRC) obtained from the variation of the accuracy with the dropout rate as the model transitions from memorization to generalization. The variance of the test accuracy under stochastic dropout across training checkpoints further exhibits a local maximum during the grokking. Additionally, the percentage of inactive neurons decreases during generalization, while the embeddings tend to a bimodal distribution independent of initialization that correlates with the observed cosine similarity patterns and dataset symmetries. These metrics additionally provide valuable insight into the origin and behaviour of grokking.

* 15 pages, 11 figures

Via

Access Paper or Ask Questions

Controlling Grokking with Nonlinearity and Data Symmetry

Nov 08, 2024

Ahmed Salah, David Yevick

Figure 1 for Controlling Grokking with Nonlinearity and Data Symmetry

Figure 2 for Controlling Grokking with Nonlinearity and Data Symmetry

Figure 3 for Controlling Grokking with Nonlinearity and Data Symmetry

Figure 4 for Controlling Grokking with Nonlinearity and Data Symmetry

Abstract:This paper demonstrates that grokking behavior in modular arithmetic with a modulus P in a neural network can be controlled by modifying the profile of the activation function as well as the depth and width of the model. Plotting the even PCA projections of the weights of the last NN layer against their odd projections further yields patterns which become significantly more uniform when the nonlinearity is increased by incrementing the number of layers. These patterns can be employed to factor P when P is nonprime. Finally, a metric for the generalization ability of the network is inferred from the entropy of the layer weights while the degree of nonlinearity is related to correlations between the local entropy of the weights of the neurons in the final layer.

* 15 pages, 14 figures

Via

Access Paper or Ask Questions

Branched Variational Autoencoder Classifiers

Jan 04, 2024

Ahmed Salah, David Yevick

Abstract:This paper introduces a modified variational autoencoder (VAEs) that contains an additional neural network branch. The resulting branched VAE (BVAE) contributes a classification component based on the class labels to the total loss and therefore imparts categorical information to the latent representation. As a result, the latent space distributions of the input classes are separated and ordered, thereby enhancing the classification accuracy. The degree of improvement is quantified by numerical calculations employing the benchmark MNIST dataset for both unrotated and rotated digits. The proposed technique is then compared to and then incorporated into a VAE with fixed output distributions. This procedure is found to yield improved performance for a wide range of output distributions.

Via

Access Paper or Ask Questions