Abstract:Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP
Abstract:Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/
Abstract:Fairness in artificial intelligence and machine learning (AI/ML) models is becoming critically important, especially as decisions made by these systems impact diverse groups. In education, a vital sector for all countries, the widespread application of AI/ML systems raises specific concerns regarding fairness. Current research predominantly focuses on fairness for individual sensitive features, which limits the comprehensiveness of fairness assessments. This paper introduces FAIREDU, a novel and effective method designed to improve fairness across multiple sensitive features. Through extensive experiments, we evaluate FAIREDU effectiveness in enhancing fairness without compromising model performance. The results demonstrate that FAIREDU addresses intersectionality across features such as gender, race, age, and other sensitive features, outperforming state-of-the-art methods with minimal effect on model accuracy. The paper also explores potential future research directions to enhance further the method robustness and applicability to various machine-learning models and datasets.