Abstract:Higher education dropout constitutes a critical challenge for tertiary education systems worldwide. While machine learning techniques can achieve high predictive accuracy on selected datasets, their adoption by policymakers remains limited and unsatisfactory, particularly when the objective is the unsupervised identification and characterization of student subgroups at elevated risk of dropout. The model introduced in this paper is a specialized form of logistic regression, specifically adapted to the context of university dropout analysis. Logistic regression continues to serve as a foundational tool among reliable statistical models, primarily due to the ease with which its parameters can be interpreted in terms of odds ratios. Our approach significantly extends this framework by incorporating heterogeneity within the student population. This is achieved through the application of a preliminary clustering algorithm that identifies latent subgroups, each characterized by distinct dropout propensities, which are then modeled via cluster-specific effects. We provide a detailed interpretation of the model parameters within this extended framework and enhance interpretability by imposing sparsity through a tailored variant of the LASSO algorithm. To demonstrate the practical applicability of the proposed methodology, we present an extensive case study based on the Italian university system, in which all the developed tools are systematically applied
Abstract:This paper presents a variant of the Multinomial mixture model tailored for the unsupervised classification of short text data. Traditionally, the Multinomial probability vector in this hierarchical model is assigned a Dirichlet prior distribution. Here, however, we explore an alternative prior - the Beta-Liouville distribution - which offers a more flexible correlation structure than the Dirichlet. We examine the theoretical properties of the Beta-Liouville distribution, focusing on its conjugacy with the Multinomial likelihood. This property enables the derivation of update equations for a CAVI (Coordinate Ascent Variational Inference) variational algorithm, facilitating the approximate posterior estimation of model parameters. Additionally, we propose a stochastic variant of the CAVI algorithm that enhances scalability. The paper concludes with data examples that demonstrate effective strategies for setting the Beta-Liouville hyperparameters.