Abstract:Conformal prediction provides distribution-free coverage guarantees, but in many-class classification it may still under-cover specific classes or subpopulations, preventing safe deployment in high-stakes applications. We propose Cluster Frequency Conformal Prediction (CFCP), a plug-in framework that adapts conformal prediction to local structure in a learned representation space. CFCP clusters learned embeddings, estimates cluster-level label-frequency distributions from calibration data, and for each test point constructs a sample-specific probability vector by softly mixing nearby cluster distributions regularized with global-prior and reliability-aware shrinkage. This vector is then conformalized using standard set constructors. In the disjoint-split regime, CFCP inherits standard finite-sample marginal validity. Under additional assumptions, CFCP further admits a local-validity interpretation. Since representation clusters aggregate locally similar samples, their empirical class frequencies provide a stable estimate of local label ambiguity. Across image and text benchmarks, CFCP achieves the best class coverage in 15/16 dataset/score-family comparisons and a competitive prediction set size efficiency, with several settings substantially more efficient. Overall, our results show that cluster-frequency information provides an effective localized signal for improving classwise reliability in many-class conformal prediction.
Abstract:Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in $[0,1]$ that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.
Abstract:Machine learning models are essential for decision-making and risk assessment, requiring highly reliable predictions in terms of both discrimination and calibration. While calibration often receives less attention, it is crucial for critical decisions, such as those in clinical predictions. We introduce a novel calibration pipeline that leverages an ensemble of calibration functions trained on clusters of learned representations of the input samples to enhance overall calibration. This approach not only improves the calibration score of various methods from 82.28% up to 100% but also introduces a unique matching metric that ensures model selection optimizes both discrimination and calibration. Our generic scheme adapts to any underlying representation, clustering, calibration methods and metric, offering flexibility and superior performance across commonly used calibration methods.