Abstract:We study estimation in the low signal-to-noise ratio (SNR) regime for a broad class of Gaussian latent-variable models, including Gaussian mixtures and orbit recovery problems. We show that, in this regime, the generalized method-of-moments (GMoM) matches the first-order asymptotic efficiency of maximum likelihood. In particular, if the moment features are chosen up to the minimal local order required for identification and are weighted optimally, then the resulting GMoM estimator has the same leading asymptotic covariance as the maximum-likelihood estimator. Our analysis shows that, in low SNR, this equivalence is governed by a layered local geometry: different directions become informative at different moment orders, partitioning the space into layers with distinct SNR scalings. We prove that the observed Fisher information and the GMoM information operator admit matching layerwise expansions across these layers. As a consequence, in the low-SNR regime, GMoM provides a statistically efficient alternative to maximum likelihood, while preserving the computational advantages of moment-based estimation.
Abstract:Motivated by structural biology applications, we study the projected multi-reference alignment (MRA) model, in which an unknown signal is observed through noisy samples, each generated by applying a random cyclic shift followed by a fixed projection. The projection merges reflection-symmetric index pairs, thereby discarding orientation information. The goal is to recover the dihedral orbit of the signal. We prove that in the high-noise regime, the first three moments of the projected observations determine a generic dihedral orbit. The main mechanism is a reduction, at the moment level, from projected MRA to the reflection-invariant phase-coupling structure of dihedral MRA. In Fourier-cosine coordinates adapted to the projection, the first moment determines the mean component, the second moment determines the Fourier magnitudes, and selected third moments yield the cosine phase-coupling relations appearing in the dihedral bispectrum. These relations lead to a constructive recovery scheme from moments up to order three. We complement the population theory with finite-sample experiments comparing expectation--maximization (EM), direct moment optimization, and direct Fourier-cosine moment optimization. The results show that, in the high-noise regime, both EM and direct moment optimization are consistent with the predicted third-moment sample-complexity scaling $n \gtrsim σ^6$, where $n$ is the number of observations and $σ^2$ is the noise variance.
Abstract:We study estimation and clustering in Gaussian mixture models under variance misspecification. Observations are generated with true variance $σ^2$, while the component means are estimated using a likelihood with variance $τ^2$, yielding a family of mismatched likelihood functions parameterized by the ratio $ρ=τ/σ$. We show that the interplay between $ρ$ and the signal-to-noise ratio (SNR) induces a sharp phase diagram. Under correct specification ($ρ=1$), maximum likelihood recovers the true means, independently of the SNR. However, once the model is misspecified, two different regimes emerge. Under under-smoothing ($ρ<1$), the estimated Gaussian means are displaced from the truth, and in low SNR this discrepancy grows as the SNR decreases: for every fixed $ρ<1$, the squared error scales as $\mathrm{SNR}^{-1}$. Under over-smoothing ($ρ>1$), the fitted likelihood blurs the cluster separation, causing distinct component means to collapse towards the overall mixture center once $ρ^2$ exceeds a threshold of the form $1 + λ\,\mathrm{SNR}$, where $λ$ depends on the geometry of the true means. We further show that the hard assignment objective arises as the limit $τ\to 0$ of the same mismatched likelihood family, and derive corresponding low- and high-SNR results for hard-assignment mean estimation and latent-label recovery. Furthermore, in low SNR, Bayes-optimal clustering is close to random guessing, and the hard-assignment target remains far from the true means. These results show that in low-SNR applications, even mild variance misspecification or hard-assignment procedures can induce substantial bias, whereas in high SNR these effects are largely absent.
Abstract:Let $f:\mathbb{R}^n\to\mathbb{R}$ be an unknown object, and suppose the observations are tomographic projections of randomly rotated copies of $f$ of the form $Y = P(R\cdot f)$, where $R$ is Haar-uniform in $\mathrm{SO}(n)$ and $P$ is the projection onto an $m$-dimensional subspace, so that $Y:\mathbb{R}^m\to\mathbb{R}$. We prove that, whenever $d\le m$, the $d$-th order moment of the projected data determines the full $d$-th order Haar-orbit moment of $f$, independently of the ambient dimension $n$. We further provide an explicit algorithmic procedure for recovering the latter from the former. As a consequence, any identifiability result for the unprojected model based on $d$-th order group-invariant moment extends directly to the tomographic setting at the same moment order. In particular, for $n=3$, $m=2$, and $d=2$, our result recovers a classical result in the cryo-EM literature: the covariance of the 2D projection images determines the second order rotationally invariant moment of the underlying 3D object.
Abstract:We study the recovery of an unknown three-dimensional band-limited signal from multiple noisy observations that are randomly rotated by latent elements of SO(3), where the rotations are drawn from an unknown, non-uniform distribution. Because the rotations are unobserved, only the signal orbit under the rotation group can be recovered. We show that the signal orbit and the rotation distribution are jointly identifiable from the first and second moments. This yields an improved high-noise sample complexity that scales quadratically with the noise variance, rather than cubically as in the uniform-rotation case. We further develop a provable, computationally efficient reconstruction algorithm that recovers the 3-D signal by successively solving a sequence of well-conditioned linear systems. The algorithm is validated through extensive numerical experiments. Our results provide a principled and tractable framework for high-noise 3-D orbit recovery, with potential relevance to cryo-electron microscopy and cryo-electron tomography modeling, where molecules are observed in unknown orientations.
Abstract:Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.
Abstract:Motivated by single-particle cryo-electron microscopy, multi-reference alignment (MRA) models the task of recovering an unknown signal from multiple noisy observations corrupted by random rotations. The standard approach, expectation-maximization (EM), often becomes computationally prohibitive, particularly in low signal-to-noise ratio (SNR) settings. We introduce an alternative, ultra-fast algorithm for MRA over the special orthogonal group $\mathrm{SO}(2)$. By performing a Taylor expansion of the log-likelihood in the low-SNR regime, we estimate the signal by sequentially computing data-driven averages of observations. Our method requires only one pass over the data, dramatically reducing computational cost compared to EM. Numerical experiments show that the proposed approach achieves high accuracy in low-SNR environments and provides an excellent initialization for subsequent EM refinement.




Abstract:We study the orbit recovery problem under the rigid-motion group SE(n), where the objective is to reconstruct an unknown signal from multiple noisy observations subjected to unknown rotations and translations. This problem is fundamental in signal processing, computer vision, and structural biology. Our main theoretical contribution is bounding the sample complexity of this problem. We show that if the d-th order moment under the rotation group SO(n) uniquely determines the signal orbit, then orbit recovery under SE(n) is achievable with $N\gtrsim σ^{2d+4}$ samples as the noise variance $σ^2 \to \infty$. The key technical insight is that the d-th order SO(n) moments can be explicitly recovered from (d+2)-order SE(n) autocorrelations, enabling us to transfer known results from the rotation-only setting to the rigid-motion case. We further harness this result to derive a matching bound to the sample complexity of the multi-target detection model that serves as an abstract framework for electron-microscopy-based technologies in structural biology, such as single-particle cryo-electron microscopy (cryo-EM) and cryo-electron tomography (cryo-ET). Beyond theory, we present a provable computational pipeline for rigid-motion orbit recovery in three dimensions. Starting from rigid-motion autocorrelations, we extract the SO(3) moments and demonstrate successful reconstruction of a 3-D macromolecular structure. Importantly, this algorithmic approach is valid at any noise level, suggesting that even very small macromolecules, long believed to be inaccessible using structural biology electron-microscopy-based technologies, may, in principle, be reconstructed given sufficient data.
Abstract:Principal component analysis (PCA) is a fundamental technique for dimensionality reduction and denoising; however, its application to three-dimensional data with arbitrary orientations -- common in structural biology -- presents significant challenges. A naive approach requires augmenting the dataset with many rotated copies of each sample, incurring prohibitive computational costs. In this paper, we extend PCA to 3D volumetric datasets with unknown orientations by developing an efficient and principled framework for SO(3)-invariant PCA that implicitly accounts for all rotations without explicit data augmentation. By exploiting underlying algebraic structure, we demonstrate that the computation involves only the square root of the total number of covariance entries, resulting in a substantial reduction in complexity. We validate the method on real-world molecular datasets, demonstrating its effectiveness and opening up new possibilities for large-scale, high-dimensional reconstruction problems.




Abstract:We study the multi-reference alignment model, which involves recovering a signal from noisy observations that have been randomly transformed by an unknown group action, a fundamental challenge in statistical signal processing, computational imaging, and structural biology. While much of the theoretical literature has focused on the asymptotic sample complexity of this model, the practical performance of reconstruction algorithms, particularly of the omnipresent expectation maximization (EM) algorithm, remains poorly understood. In this work, we present a detailed investigation of EM in the challenging low signal-to-noise ratio (SNR) regime. We identify and characterize two failure modes that emerge in this setting. The first, called Einstein from Noise, reveals a strong sensitivity to initialization, with reconstructions resembling the input template regardless of the true underlying signal. The second phenomenon, referred to as the Ghost of Newton, involves EM initially converging towards the correct solution but later diverging, leading to a loss of reconstruction fidelity. We provide theoretical insights and support our findings through numerical experiments. Finally, we introduce a simple, yet effective modification to EM based on mini-batching, which mitigates the above artifacts. Supported by both theory and experiments, this mini-batching approach processes small data subsets per iteration, reducing initialization bias and computational cost, while maintaining accuracy comparable to full-batch EM.