Matrix factorization is an inference problem that has acquired importance due to its vast range of applications that go from dictionary learning to recommendation systems and machine learning with deep networks. The study of its fundamental statistical limits represents a true challenge, and despite a decade-long history of efforts in the community, there is still no closed formula able to describe its optimal performances in the case where the rank of the matrix scales linearly with its size. In the present paper, we study this extensive rank problem, extending the alternative 'decimation' procedure that we recently introduced, and carry out a thorough study of its performance. Decimation aims at recovering one column/line of the factors at a time, by mapping the problem into a sequence of neural network models of associative memory at a tunable temperature. Though being sub-optimal, decimation has the advantage of being theoretically analyzable. We extend its scope and analysis to two families of matrices. For a large class of compactly supported priors, we show that the replica symmetric free entropy of the neural network models takes a universal form in the low temperature limit. For sparse Ising prior, we show that the storage capacity of the neural network models diverges as sparsity in the patterns increases, and we introduce a simple algorithm based on a ground state search that implements decimation and performs matrix factorization, with no need of an informative initialization.
We carry out an information-theoretical analysis of a two-layer neural network trained from input-output pairs generated by a teacher network with matching architecture, in overparametrized regimes. Our results come in the form of bounds relating i) the mutual information between training data and network weights, or ii) the Bayes-optimal generalization error, to the same quantities but for a simpler (generalized) linear model for which explicit expressions are rigorously known. Our bounds, which are expressed in terms of the number of training samples, input dimension and number of hidden units, thus yield fundamental performance limits for any neural network (and actually any learning procedure) trained from limited data generated according to our two-layer teacher neural network model. The proof relies on rigorous tools from spin glasses and is guided by ``Gaussian equivalence principles'' lying at the core of numerous recent analyses of neural networks. With respect to the existing literature, which is either non-rigorous or restricted to the case of the learning of the readout weights only, our results are information-theoretic (i.e. are not specific to any learning algorithm) and, importantly, cover a setting where all the network parameters are trained.
Matrix factorization is an important mathematical problem encountered in the context of dictionary learning, recommendation systems and machine learning. We introduce a new `decimation' scheme that maps it to neural network models of associative memory and provide a detailed theoretical analysis of its performance, showing that decimation is able to factorize extensive-rank matrices and to denoise them efficiently. We introduce a decimation algorithm based on ground-state search of the neural network, which shows performances that match the theoretical prediction.
We study the paradigmatic spiked matrix model of principal components analysis, where the rank-one signal is corrupted by additive noise. While the noise is typically taken from a Wigner matrix with independent entries, here the potential acting on the eigenvalues has a quadratic plus a quartic component. The quartic term induces strong correlations between the matrix elements, which makes the setting relevant for applications but analytically challenging. Our work provides the first characterization of the Bayes-optimal limits for inference in this model with structured noise. If the signal prior is rotational-invariant, then we show that a spectral estimator is optimal. In contrast, for more general priors, the existing approximate message passing algorithm (AMP) falls short of achieving the information-theoretic limits, and we provide a justification for this sub-optimality. Finally, by generalizing the theory of Thouless-Anderson-Palmer equations, we cure the issue by proposing a novel AMP which matches the theoretical limits. Our information-theoretic analysis is based on the replica method, a powerful heuristic from statistical mechanics; instead, the novel AMP comes with a rigorous state evolution analysis tracking its performance in the high-dimensional limit. Even if we focus on a specific noise distribution, our methodology can be generalized to a wide class of trace ensembles, at the cost of more involved expressions.