Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tarek M. Zikry

Group-Aware Matrix Estimation and Latent Subspace Recovery

May 19, 2026

Hamza Golubovic, Matthew Shen, Genevera I. Allen, Tarek M. Zikry

Abstract:Modern matrix completion problems often involve heterogeneous data whose rows simultaneously belong to many meta-categories, such as demographic and age groups in recommendation systems, or region and recording session labels in neural electrophysiological experiments. Standard low-rank estimators impose a single global latent geometry, which can recover average structure but may smooth away subgroup-specific variation, especially when observations are unevenly distributed across groups. We introduce Group-Aware Matrix Estimation (GAME), a convex estimator for overlapping subgroup-wise low-rank matrix estimation. GAME regularizes category-specific submatrices through overlapping nuclear-norm penalties, allowing related groups to borrow information while preserving local latent structure in a shared coordinate system. We provide finite-sample guarantees for both reconstruction error and subgroup-specific subspace recovery, showing how performance depends on sampling density, subgroup rank, and overlap structure. Experiments on synthetic, recommendation, ecological, and neuroscience datasets show that GAME is most beneficial in structured missingness regimes, where subgroup-aware regularization improves both reconstruction accuracy and latent subspace fidelity. Across these benchmarks, GAME is competitive or best among global low-rank, side-information, and modern imputation baselines, with the largest gains when subgroups exhibit distinct low-rank structure.

* 12 pages, 6 main figures, 1 main algorithm

Via

Access Paper or Ask Questions

Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

Jun 05, 2025

Andersen Chang, Tiffany M. Tang, Tarek M. Zikry, Genevera I. Allen

Figure 1 for Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

Figure 2 for Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

Figure 3 for Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

Figure 4 for Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

Abstract:Unsupervised machine learning is widely used to mine large, unlabeled datasets to make data-driven discoveries in critical domains such as climate science, biomedicine, astronomy, chemistry, and more. However, despite its widespread utilization, there is a lack of standardization in unsupervised learning workflows for making reliable and reproducible scientific discoveries. In this paper, we present a structured workflow for using unsupervised learning techniques in science. We highlight and discuss best practices starting with formulating validatable scientific questions, conducting robust data preparation and exploration, using a range of modeling techniques, performing rigorous validation by evaluating the stability and generalizability of unsupervised learning conclusions, and promoting effective communication and documentation of results to ensure reproducible scientific discoveries. To illustrate our proposed workflow, we present a case study from astronomy, seeking to refine globular clusters of Milky Way stars based upon their chemical composition. Our case study highlights the importance of validation and illustrates how the benefits of a carefully-designed workflow for unsupervised learning can advance scientific discovery.

* 23 pages, 4 figures, 12 additional pages of citations

Via

Access Paper or Ask Questions

Are machine learning interpretations reliable? A stability study on global interpretations

May 21, 2025

Luqin Gan, Tarek M. Zikry, Genevera I. Allen

Figure 1 for Are machine learning interpretations reliable? A stability study on global interpretations

Figure 2 for Are machine learning interpretations reliable? A stability study on global interpretations

Figure 3 for Are machine learning interpretations reliable? A stability study on global interpretations

Figure 4 for Are machine learning interpretations reliable? A stability study on global interpretations

Abstract:As machine learning systems are increasingly used in high-stakes domains, there is a growing emphasis placed on making them interpretable to improve trust in these systems. In response, a range of interpretable machine learning (IML) methods have been developed to generate human-understandable insights into otherwise black box models. With these methods, a fundamental question arises: Are these interpretations reliable? Unlike with prediction accuracy or other evaluation metrics for supervised models, the proximity to the true interpretation is difficult to define. Instead, we ask a closely related question that we argue is a prerequisite for reliability: Are these interpretations stable? We define stability as findings that are consistent or reliable under small random perturbations to the data or algorithms. In this study, we conduct the first systematic, large-scale empirical stability study on popular machine learning global interpretations for both supervised and unsupervised tasks on tabular data. Our findings reveal that popular interpretation methods are frequently unstable, notably less stable than the predictions themselves, and that there is no association between the accuracy of machine learning predictions and the stability of their associated interpretations. Moreover, we show that no single method consistently provides the most stable interpretations across a range of benchmark datasets. Overall, these results suggest that interpretability alone does not warrant trust, and underscores the need for rigorous evaluation of interpretation stability in future work. To support these principles, we have developed and released an open source IML dashboard and Python package to enable researchers to assess the stability and reliability of their own data-driven interpretations and discoveries.

* 17 pages main text, 5 main text figures. 57 pages in total with Appendix and Bibliography

Via

Access Paper or Ask Questions