Abstract:Microcalcification (MC) analysis is clinically important in screening mammography because clustered puncta can be an early sign of malignancy, yet dense MC segmentation remains challenging: targets are extremely small and sparse, dense pixel-level labels are expensive and ambiguous, and cross-site shift often induces texture-driven false positives and missed puncta in dense tissue. We propose MC-GenRef, a real dense-label-free framework that combines high-fidelity synthetic supervision with test-time generative posterior refinement (TT-GPR). During training, real negative mammogram patches are used as backgrounds, and physically plausible MC patterns are injected through a lightweight image formation model with local contrast modulation and blur, yielding exact image-mask pairs without real dense annotation. Using only these synthetic labeled pairs, MC-GenRef trains a base segmentor and a seed-conditioned rectified-flow (RF) generator that serves as a controllable generative prior. During inference, TT-GPR treats segmentation as approximate posterior inference: it derives a sparse seed from the current prediction, forms seed-consistent RF projections, converts them into case-specific surrogate targets through the frozen segmentor, and iteratively refines the logits with overlap-consistent and edge-aware regularization. On INbreast, the synthetic-only initializer achieved the best Dice without real dense annotations, while TT-GPR improved miss-sensitive performance to Recall and FNR, with strong class-balanced behavior (Bal.Acc., G-Mean). On an external private Yonsei cohort ( n=50 ), TT-GPR consistently improved the synthetic-only initializer under cross-site shift, increasing Dice and Recall while reducing FNR. These results suggest that test-time generative posterior refinement is a practical route to reduce MC misses and improve robustness without additional real dense labeling.




Abstract:This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.