Non-negative matrix factorization (NMF) is an important technique for obtaining low dimensional representations of datasets. However, classical NMF does not take into account data that is collected at different times or in different locations, which may exhibit heterogeneity. We resolve this problem by solving a modified NMF objective, Stratified-NMF, that simultaneously learns strata-dependent statistics and a shared topics matrix. We develop multiplicative update rules for this novel objective and prove convergence of the objective. Then, we experiment on synthetic data to demonstrate the efficiency and accuracy of the method. Lastly, we apply our method to three real world datasets and empirically investigate their learned features.
We introduce a class of manifold neural networks (MNNs) that we call Manifold Filter-Combine Networks (MFCNs), that aims to further our understanding of MNNs, analogous to how the aggregate-combine framework helps with the understanding of graph neural networks (GNNs). This class includes a wide variety of subclasses that can be thought of as the manifold analog of various popular GNNs. We then consider a method, based on building a data-driven graph, for implementing such networks when one does not have global knowledge of the manifold, but merely has access to finitely many sample points. We provide sufficient conditions for the network to provably converge to its continuum limit as the number of sample points tends to infinity. Unlike previous work (which focused on specific graph constructions), our rate of convergence does not directly depend on the number of filters used. Moreover, it exhibits linear dependence on the depth of the network rather than the exponential dependence obtained previously. Additionally, we provide several examples of interesting subclasses of MFCNs and of the rates of convergence that are obtained under specific graph constructions.
We study benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted or flipped. We identify conditions on the margin of the clean data that give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.
Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and discusses convergence guarantees for stochastic natural thresholding algorithms by extending the NT from the deterministic version with linear measurements to the stochastic version with a general objective function. We also conduct various numerical experiments on linear and nonlinear measurements to demonstrate the performance of StoNT.
Societal biases in the usage of words, including harmful stereotypes, are frequently learned by common word embedding methods. These biases manifest not only between a word and an explicit marker of its stereotype, but also between words that share related stereotypes. This latter phenomenon, sometimes called "indirect bias,'' has resisted prior attempts at debiasing. In this paper, we propose a novel method called Biased Indirect Relationship Modification (BIRM) to mitigate indirect bias in distributional word embeddings by modifying biased relationships between words before embeddings are learned. This is done by considering how the co-occurrence probability of a given pair of words changes in the presence of words marking an attribute of bias, and using this to average out the effect of a bias attribute. To evaluate this method, we perform a series of common tests and demonstrate that measures of bias in the word embeddings are reduced in exchange for minor reduction in the semantic quality of the embeddings. In addition, we conduct novel tests for measuring indirect stereotypes by extending the Word Embedding Association Test (WEAT) with new test sets for indirect binary gender stereotypes. With these tests, we demonstrate the presence of more subtle stereotypes not addressed by previous work. The proposed method is able to reduce the presence of some of these new stereotypes, serving as a crucial next step towards non-stereotyped word embeddings.
We study the tensor robust principal component analysis (TRPCA) problem, a tensorial extension of matrix robust principal component analysis (RPCA), that aims to split the given tensor into an underlying low-rank component and a sparse outlier component. This work proposes a fast algorithm, called Robust Tensor CUR Decompositions (RTCUR), for large-scale non-convex TRPCA problems under the Tucker rank setting. RTCUR is developed within a framework of alternating projections that projects between the set of low-rank tensors and the set of sparse tensors. We utilize the recently developed tensor CUR decomposition to substantially reduce the computational complexity in each projection. In addition, we develop four variants of RTCUR for different application settings. We demonstrate the effectiveness and computational advantages of RTCUR against state-of-the-art methods on both synthetic and real-world datasets.
The Kaczmarz method (KZ) and its variants, which are types of stochastic gradient descent (SGD) methods, have been extensively studied due to their simplicity and efficiency in solving linear equation systems. The iterative thresholding (IHT) method has gained popularity in various research fields, including compressed sensing or sparse linear regression, machine learning with additional structure, and optimization with nonconvex constraints. Recently, a hybrid method called Kaczmarz-based IHT (KZIHT) has been proposed, combining the benefits of both approaches, but its theoretical guarantees are missing. In this paper, we provide the first theoretical convergence guarantees for KZIHT by showing that it converges linearly to the solution of a system with sparsity constraints up to optimal statistical bias when the reshuffling data sampling scheme is used. We also propose the Kaczmarz with periodic thresholding (KZPT) method, which generalizes KZIHT by applying the thresholding operation for every certain number of KZ iterations and by employing two different types of step sizes. We establish a linear convergence guarantee for KZPT for randomly subsampled bounded orthonormal systems (BOS) and mean-zero isotropic sub-Gaussian random matrices, which are most commonly used models in compressed sensing, dimension reduction, matrix sketching, and many inverse problems in neural networks. Our analysis shows that KZPT with an optimal thresholding period outperforms KZIHT. To support our theory, we include several numerical experiments.
One-bit quantization with time-varying sampling thresholds has recently found significant utilization potential in statistical signal processing applications due to its relatively low power consumption and low implementation cost. In addition to such advantages, an attractive feature of one-bit analog-to-digital converters (ADCs) is their superior sampling rates as compared to their conventional multi-bit counterparts. This characteristic endows one-bit signal processing frameworks with what we refer to as sample abundance. On the other hand, many signal recovery and optimization problems are formulated as (possibly non-convex) quadratic programs with linear feasibility constraints in the one-bit sampling regime. We demonstrate, with a particular focus on quadratic compressed sensing, that the sample abundance paradigm allows for the transformation of such quadratic problems to merely a linear feasibility problem by forming a large-scale overdetermined linear system; thus removing the need for costly optimization constraints and objectives. To efficiently tackle the emerging overdetermined linear feasibility problem, we further propose an enhanced randomized Kaczmarz algorithm, called Block SKM. Several numerical results are presented to illustrate the effectiveness of the proposed methodologies.
We introduce a new method based on nonnegative matrix factorization, Neural NMF, for detecting latent hierarchical structure in data. Datasets with hierarchical structure arise in a wide variety of fields, such as document classification, image processing, and bioinformatics. Neural NMF recursively applies NMF in layers to discover overarching topics encompassing the lower-level features. We derive a backpropagation optimization scheme that allows us to frame hierarchical NMF as a neural network. We test Neural NMF on a synthetic hierarchical dataset, the 20 Newsgroups dataset, and the MyLymeData symptoms dataset. Numerical results demonstrate that Neural NMF outperforms other hierarchical NMF methods on these data sets and offers better learned hierarchical structure and interpretability of topics.