Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.
Ordinal data is widely prevalent in clinical and other domains, yet there is a lack of both modern, machine-learning based methods and publicly available software to address it. In this paper, we present a model-agnostic method of ordinal classification, which can apply any non-ordinal classification method in an ordinal fashion. We also provide an open-source implementation of these algorithms, in the form of a Python package. We apply these models on multiple real-world datasets to show their performance across domains. We show that they often outperform non-ordinal classification methods, especially when the number of datapoints is relatively small or when there are many classes of outcomes. This work, including the developed software, facilitates the use of modern, more powerful machine learning algorithms to handle ordinal data.
The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.
We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet
It has been suggested that generative image models such as diffusion models can improve performance on clinically relevant tasks by offering deep learning models supplementary training data. However, most conditional diffusion models treat disease stages as independent classes, ignoring the continuous nature of disease progression. This mismatch is problematic in medical imaging because continuous pathological processes are typically only observed through coarse, discrete but ordered labels as in ophthalmology for diabetic retinopathy (DR). We propose an ordinal latent diffusion model for generating color fundus images that explicitly incorporates the ordered structure of DR severity into the generation process. Instead of categorical conditioning, we used a scalar disease representation, enabling a smooth transition between adjacent stages. We evaluated our approach using visual realism metrics and classification-based clinical consistency analysis on the EyePACS dataset. Compared to a standard conditional diffusion model, our model reduced the Fréchet inception distance for four of the five DR stages and increased the quadratic weighted $κ$ from 0.79 to 0.87. Furthermore, interpolation experiments showed that the model captured a continuous spectrum of disease progression learned from ordered, coarse class labels.
Purpose: Echocardiography with point-of-care ultrasound (POCUS) must support clinical decision-making under tight bedside time and operator-effort constraints. We introduce a personalized data acquisition strategy in which an RL agent, given a partially observed multi-view study, selects the next view to acquire or terminates acquisition to support heart-failure (HF) assessment. Upon termination, a diagnostic model jointly predicts aortic stenosis (AS) severity and left ventricular ejection fraction (LVEF), two key HF biomarkers, and outputs uncertainty, enabling an explicit trade-off between diagnostic performance and acquisition cost. Methods: We model POCUS as a sequential acquisition problem: at each step, a video selector (RL agent) chooses the next view to acquire or terminates acquisition. Upon termination, a shared multi-view transformer performs multi-task inference with two heads, ordinal AS classification, and LVEF regression, and outputs Gaussian predictive distributions yielding ordinal probabilities over AS classes and EF thresholds. These probabilities drive a reward that balances expected diagnostic benefit against acquisition cost, producing patient-specific acquisition pathways. Results: The dataset comprises 12,180 patient-level studies, split into training/validation/test sets (75/15/15). On the 1,820 test studies, our method matches full-study performance while using 32% fewer videos, achieving 77.2% mean balanced accuracy (bACC) across AS severity classification and LVEF estimation, demonstrating robust multi-task performance under acquisition budgets. Conclusion: Patient-tailored, cost-aware acquisition can streamline POCUS workflows while preserving decision quality, producing interpretable scan pathways suited to bedside use. The framework is extensible to additional cardiac endpoints and merits prospective evaluation for clinical integration.
Diabetes mellitus is a chronic metabolic disorder characterized by persistent hyperglycemia due to insufficient insulin production or impaired insulin utilization. One of its most severe complications is diabetic retinopathy (DR), a progressive retinal disease caused by microvascular damage, leading to hemorrhages, exudates, and potential vision loss. Early and reliable detection of DR is therefore critical for preventing irreversible blindness. In this work, we propose an uncertainty-aware deep learning framework for automated DR severity grading that explicitly models the ordinal nature of disease progression. Our approach combines a convolutional backbone with lesion-query attention pooling and an evidential Dirichlet-based ordinal regression head, enabling both accurate severity prediction and principled estimation of predictive uncertainty. The model is trained using an ordinal evidential loss with annealed regularization to encourage calibrated confidence under domain shift. We evaluate the proposed method on a multi-domain training setup combining APTOS, Messidor-2, and a subset of EyePACS fundus datasets. Experimental results demonstrate strong cross-dataset generalization, achieving competitive classification accuracy and high quadratic weighted kappa on held-out test sets, while providing meaningful uncertainty estimates for low-confidence cases. These results suggest that ordinal evidential learning is a promising direction for robust and clinically reliable diabetic retinopathy grading.
Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.
Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.
Knee osteoarthritis (KOA) grading based on radiographic images is a critical yet challenging task due to subtle inter-grade differences, annotation uncertainty, and the inherently ordinal nature of disease progression. Conventional deep learning approaches typically formulate this problem as deterministic multi-class classification, ignoring both the continuous progression of degeneration and the uncertainty in expert annotations. In this work, we propose ClinNet, a novel trustworthy framework that addresses KOA grading as an evidential ordinal regression problem. The proposed method integrates three key components: (1) a Bilateral Asymmetry Encoder (BAE) that explicitly models medial-lateral structural discrepancies; (2) a Diagnostic Memory Bank that maintains class-wise prototypes to stabilize feature representations; and (3) an Evidential Ordinal Head based on the Normal-Inverse-Gamma (NIG) distribution to jointly estimate continuous KL grades and epistemic uncertainty. Extensive experiments demonstrate that ClinNet achieves a Quadratic Weighted Kappa of 0.892 and Accuracy of 0.768, statistically outperforming state-of-the-art baselines (p < 0.001). Crucially, we demonstrate that the model's uncertainty estimates successfully flag out-of-distribution samples and potential misdiagnoses, paving the way for safe clinical deployment.