Abstract:The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.
Abstract:Biclustering techniques have been widely used to identify homogeneous subgroups within large data matrices, such as subsets of genes similarly expressed across subsets of patients. Mining a max-sum sub-matrix is a related but distinct problem for which one looks for a (non-necessarily contiguous) rectangular sub-matrix with a maximal sum of its entries. Le Van et al. (Ranked Tiling, 2014) already illustrated its applicability to gene expression analysis and addressed it with a constraint programming (CP) approach combined with large neighborhood search (CP-LNS). In this work, we exhibit some key properties of this NP-hard problem and define a bounding function such that larger problems can be solved in reasonable time. Two different algorithms are proposed in order to exploit the highlighted characteristics of the problem: a CP approach with a global constraint (CPGC) and mixed integer linear programming (MILP). Practical experiments conducted both on synthetic and real gene expression data exhibit the characteristics of these approaches and their relative benefits over the original CP-LNS method. Overall, the CPGC approach tends to be the fastest to produce a good solution. Yet, the MILP formulation is arguably the easiest to formulate and can also be competitive.
Abstract:This paper presents an original approach for jointly fitting survival times and classifying samples into subgroups. The Coxlogit model is a generalized linear model with a common set of selected features for both tasks. Survival times and class labels are here assumed to be conditioned by a common risk score which depends on those features. Learning is then naturally expressed as maximizing the joint probability of subgroup labels and the ordering of survival events, conditioned to a common weight vector. The model is estimated by minimizing a regularized log-likelihood through a coordinate descent algorithm. Validation on synthetic and breast cancer data shows that the proposed approach outperforms a standard Cox model or logistic regression when both predicting the survival times and classifying new samples into subgroups. It is also better at selecting informative features for both tasks.