Abstract:Despite the constant development of new bias mitigation methods for machine learning, no method consistently succeeds, and a fundamental question remains unanswered: when and why do bias mitigation techniques fail? In this paper, we hypothesise that a key factor may be the often-overlooked but crucial step shared by many bias mitigation methods: the definition of subgroups. To investigate this, we conduct a comprehensive evaluation of state-of-the-art bias mitigation methods across multiple vision and language classification tasks, systematically varying subgroup definitions, including coarse, fine-grained, intersectional, and noisy subgroups. Our results reveal that subgroup choice significantly impacts performance, with certain groupings paradoxically leading to worse outcomes than no mitigation at all. Our findings suggest that observing a disparity between a set of subgroups is not a sufficient reason to use those subgroups for mitigation. Through theoretical analysis, we explain these phenomena and uncover a counter-intuitive insight that, in some cases, improving fairness with respect to a particular set of subgroups is best achieved by using a different set of subgroups for mitigation. Our work highlights the importance of careful subgroup definition in bias mitigation and presents it as an alternative lever for improving the robustness and fairness of machine learning models.
Abstract:Recent work has uncovered alarming disparities in the performance of machine learning models in healthcare. In this study, we explore whether such disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model on these images. We assess possible disparities across various population groups and find substantial differences despite strong overall performance of the model. In particular, we discover unfair performance for certain assessment centres, which is surprising given the rigorous data standardisation protocol. We compare how these differences emerge and apply a range of existing bias mitigation methods to each one. A key insight is that each disparity has unique properties and responds differently to the mitigation methods. We also find that these methods are largely unable to enhance fairness, highlighting the need for better bias mitigation methods tailored to the specific type of bias.
Abstract:Fluorodeoxyglucose Positron Emission Tomography (FDG-PET) combined with Computed Tomography (CT) scans are critical in oncology to the identification of solid tumours and the monitoring of their progression. However, precise and consistent lesion segmentation remains challenging, as manual segmentation is time-consuming and subject to intra- and inter-observer variability. Despite their promise, automated segmentation methods often struggle with false positive segmentation of regions of healthy metabolic activity, particularly when presented with such a complex range of tumours across the whole body. In this paper, we explore the application of the nnUNet to tumour segmentation of whole-body PET-CT scans and conduct different experiments on optimal training and post-processing strategies. Our best model obtains a Dice score of 69\% and a false negative and false positive volume of 6.27 and 5.78 mL respectively, on our internal test set. This model is submitted as part of the autoPET 2023 challenge. Our code is available at: https://github.com/anissa218/autopet\_nnunet