Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Artur Pak

Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm

Jun 13, 2025

Zhenisbek Assylbekov, Alan Legg, Artur Pak

Abstract:We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models -- that is, when the number of components in the fitted model exceeds that of the true underlying distribution. Focusing on a structured configuration where the component means are positioned at the vertices of a regular simplex and the mixture weights satisfy a non-degeneracy condition, we demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance. Our analysis leverages the strong convexity of the negative log-likelihood function in a neighborhood around the optimum and utilizes the Polyak-{\L}ojasiewicz inequality to establish that an $\epsilon$-accurate approximation is achievable in $O(\log(1/\epsilon))$ iterations. Furthermore, we extend these results to a finite-sample setting by deriving explicit statistical convergence guarantees. Numerical experiments on synthetic datasets corroborate our theoretical findings, highlighting the dramatic acceleration in convergence compared to conventional sublinear rates. This work not only deepens the understanding of EM's behavior in overspecified settings but also offers practical insights into initialization strategies and model design for high-dimensional clustering and density estimation tasks.

* ECML PKDD 2025

Via

Access Paper or Ask Questions

Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Oct 19, 2023

Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhenisbek Assylbekov

Figure 1 for Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Figure 2 for Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Figure 3 for Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Figure 4 for Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Abstract:Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.

Via

Access Paper or Ask Questions

Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Oct 02, 2023

Rustem Takhanov, Maxat Tezekbayev, Artur Pak, Arman Bolatov, Zhibek Kadyrsizova, Zhenisbek Assylbekov

Figure 1 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Figure 2 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Figure 3 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Figure 4 for Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Abstract:The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.

* ACML 2023

Via

Access Paper or Ask Questions

Long-Tail Theory under Gaussian Mixtures

Jul 24, 2023

Arman Bolatov, Maxat Tezekbayev, Igor Melnykov, Artur Pak, Vassilina Nikoulina, Zhenisbek Assylbekov

Figure 1 for Long-Tail Theory under Gaussian Mixtures

Figure 2 for Long-Tail Theory under Gaussian Mixtures

Figure 3 for Long-Tail Theory under Gaussian Mixtures

Figure 4 for Long-Tail Theory under Gaussian Mixtures

Abstract:We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data.

* accepted to ECAI 2023

Via

Access Paper or Ask Questions