Abstract:Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to four orders of magnitude. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
Abstract:We show that a modification of the first layer of a Graph Convolutional Network (GCN) can be used to effectively propagate label information across neighbor nodes, for binary and multi-class classification problems. This is done by selectively eliminating self-loops for the label features during the training phase of a GCN. The GCN architecture is otherwise unchanged, without any extra hyper-parameters, and can be used in both a transductive and inductive setting. We show through several experiments that, depending on how many labels are available during the inference phase, this strategy can lead to a substantial improvement in the model performance compared to a standard GCN approach, including with imbalanced datasets.