Alert button
Picture for Flavio P. Calmon

Flavio P. Calmon

Alert button

Differentially Private Secure Multiplication: Hiding Information in the Rubble of Noise

Sep 28, 2023
Viveck R. Cadambe, Ateet Devulapalli, Haewon Jeong, Flavio P. Calmon

We consider the problem of private distributed multi-party multiplication. It is well-established that Shamir secret-sharing coding strategies can enable perfect information-theoretic privacy in distributed computation via the celebrated algorithm of Ben Or, Goldwasser and Wigderson (the "BGW algorithm"). However, perfect privacy and accuracy require an honest majority, that is, $N \geq 2t+1$ compute nodes are required to ensure privacy against any $t$ colluding adversarial nodes. By allowing for some controlled amount of information leakage and approximate multiplication instead of exact multiplication, we study coding schemes for the setting where the number of honest nodes can be a minority, that is $N< 2t+1.$ We develop a tight characterization privacy-accuracy trade-off for cases where $N < 2t+1$ by measuring information leakage using {differential} privacy instead of perfect privacy, and using the mean squared error metric for accuracy. A novel technical aspect is an intricately layered noise distribution that merges ideas from differential privacy and Shamir secret-sharing at different layers.

* Extended version of papers presented in IEEE ISIT 2022, IEEE ISIT 2023 and TPDP 2023 
Viaarxiv icon

Fair Machine Unlearning: Data Removal while Mitigating Disparities

Jul 27, 2023
Alex Oesterling, Jiaqi Ma, Flavio P. Calmon, Hima Lakkaraju

Figure 1 for Fair Machine Unlearning: Data Removal while Mitigating Disparities
Figure 2 for Fair Machine Unlearning: Data Removal while Mitigating Disparities
Figure 3 for Fair Machine Unlearning: Data Removal while Mitigating Disparities
Figure 4 for Fair Machine Unlearning: Data Removal while Mitigating Disparities

As public consciousness regarding the collection and use of personal information by corporations grows, it is of increasing importance that consumers be active participants in the curation of corporate datasets. In light of this, data governance frameworks such as the General Data Protection Regulation (GDPR) have outlined the right to be forgotten as a key principle allowing individuals to request that their personal data be deleted from the databases and models used by organizations. To achieve forgetting in practice, several machine unlearning methods have been proposed to address the computational inefficiencies of retraining a model from scratch with each unlearning request. While efficient online alternatives to retraining, it is unclear how these methods impact other properties critical to real-world applications, such as fairness. In this work, we propose the first fair machine unlearning method that can provably and efficiently unlearn data instances while preserving group fairness. We derive theoretical results which demonstrate that our method can provably unlearn data instances while maintaining fairness objectives. Extensive experimentation with real-world datasets highlight the efficacy of our method at unlearning data instances while preserving fairness.

* 27 pages, 3 figures, accepted to ICML 2023 DMLR Workshop 
Viaarxiv icon

Arbitrariness Lies Beyond the Fairness-Accuracy Frontier

Jun 15, 2023
Carol Xuan Long, Hsiang Hsu, Wael Alghamdi, Flavio P. Calmon

Figure 1 for Arbitrariness Lies Beyond the Fairness-Accuracy Frontier
Figure 2 for Arbitrariness Lies Beyond the Fairness-Accuracy Frontier
Figure 3 for Arbitrariness Lies Beyond the Fairness-Accuracy Frontier
Figure 4 for Arbitrariness Lies Beyond the Fairness-Accuracy Frontier

Machine learning tasks may admit multiple competing models that achieve similar performance yet produce conflicting outputs for individual samples -- a phenomenon known as predictive multiplicity. We demonstrate that fairness interventions in machine learning optimized solely for group fairness and accuracy can exacerbate predictive multiplicity. Consequently, state-of-the-art fairness interventions can mask high predictive multiplicity behind favorable group fairness and accuracy metrics. We argue that a third axis of ``arbitrariness'' should be considered when deploying models to aid decision-making in applications of individual-level impact. To address this challenge, we propose an ensemble algorithm applicable to any fairness intervention that provably ensures more consistent predictions.

Viaarxiv icon

Adapting Fairness Interventions to Missing Values

May 30, 2023
Raymond Feng, Flavio P. Calmon, Hao Wang

Figure 1 for Adapting Fairness Interventions to Missing Values
Figure 2 for Adapting Fairness Interventions to Missing Values
Figure 3 for Adapting Fairness Interventions to Missing Values
Figure 4 for Adapting Fairness Interventions to Missing Values

Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification -- a procedure referred to as "impute-then-classify" -- can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.

Viaarxiv icon

Gaussian Max-Value Entropy Search for Multi-Agent Bayesian Optimization

Mar 10, 2023
Haitong Ma, Tianpeng Zhang, Yixuan Wu, Flavio P. Calmon, Na Li

Figure 1 for Gaussian Max-Value Entropy Search for Multi-Agent Bayesian Optimization
Figure 2 for Gaussian Max-Value Entropy Search for Multi-Agent Bayesian Optimization
Figure 3 for Gaussian Max-Value Entropy Search for Multi-Agent Bayesian Optimization
Figure 4 for Gaussian Max-Value Entropy Search for Multi-Agent Bayesian Optimization

We study the multi-agent Bayesian optimization (BO) problem, where multiple agents maximize a black-box function via iterative queries. We focus on Entropy Search (ES), a sample-efficient BO algorithm that selects queries to maximize the mutual information about the maximum of the black-box function. One of the main challenges of ES is that calculating the mutual information requires computationally-costly approximation techniques. For multi-agent BO problems, the computational cost of ES is exponential in the number of agents. To address this challenge, we propose the Gaussian Max-value Entropy Search, a multi-agent BO algorithm with favorable sample and computational efficiency. The key to our idea is to use a normal distribution to approximate the function maximum and calculate its mutual information accordingly. The resulting approximation allows queries to be cast as the solution of a closed-form optimization problem which, in turn, can be solved via a modified gradient ascent algorithm and scaled to a large number of agents. We demonstrate the effectiveness of Gaussian max-value Entropy Search through numerical experiments on standard test functions and real-robot experiments on the source-seeking problem. Results show that the proposed algorithm outperforms the multi-agent BO baselines in the numerical experiments and can stably seek the source with a limited number of noisy observations on real robots.

* 10 pages, 9 figures 
Viaarxiv icon

Arbitrary Decisions are a Hidden Cost of Differentially-Private Training

Feb 28, 2023
Bogdan Kulynych, Hsiang Hsu, Carmela Troncoso, Flavio P. Calmon

Figure 1 for Arbitrary Decisions are a Hidden Cost of Differentially-Private Training
Figure 2 for Arbitrary Decisions are a Hidden Cost of Differentially-Private Training
Figure 3 for Arbitrary Decisions are a Hidden Cost of Differentially-Private Training
Figure 4 for Arbitrary Decisions are a Hidden Cost of Differentially-Private Training

Mechanisms used in privacy-preserving machine learning often aim to guarantee differential privacy (DP) during model training. Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data (e.g., adding Gaussian noise to clipped gradients). We demonstrate that such randomization incurs predictive multiplicity: for a given input example, the output predicted by equally-private models depends on the randomness used in training. Thus, for a given input, the predicted output can vary drastically if a model is re-trained, even if the same training dataset is used. The predictive-multiplicity cost of DP training has not been studied, and is currently neither audited for nor communicated to model designers and stakeholders. We derive a bound on the number of re-trainings required to estimate predictive multiplicity reliably. We analyze -- both theoretically and through extensive experiments -- the predictive-multiplicity cost of three DP-ensuring algorithms: output perturbation, objective perturbation, and DP-SGD. We demonstrate that the degree of predictive multiplicity rises as the level of privacy increases, and is unevenly distributed across individuals and demographic groups in the data. Because randomness used to ensure DP during training explains predictions for some examples, our results highlight a fundamental challenge to the justifiability of decisions supported by differentially-private models in high-stakes settings. We conclude that practitioners should audit the predictive multiplicity of their DP-ensuring algorithms before deploying them in applications of individual-level consequence.

Viaarxiv icon

Aleatoric and Epistemic Discrimination in Classification

Jan 27, 2023
Hao Wang, Luxi He, Rui Gao, Flavio P. Calmon

Figure 1 for Aleatoric and Epistemic Discrimination in Classification
Figure 2 for Aleatoric and Epistemic Discrimination in Classification
Figure 3 for Aleatoric and Epistemic Discrimination in Classification
Figure 4 for Aleatoric and Epistemic Discrimination in Classification

Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.

Viaarxiv icon

Automated Segmentation and Recurrence Risk Prediction of Surgically Resected Lung Tumors with Adaptive Convolutional Neural Networks

Sep 17, 2022
Marguerite B. Basta, Sarfaraz Hussein, Hsiang Hsu, Flavio P. Calmon

Figure 1 for Automated Segmentation and Recurrence Risk Prediction of Surgically Resected Lung Tumors with Adaptive Convolutional Neural Networks
Figure 2 for Automated Segmentation and Recurrence Risk Prediction of Surgically Resected Lung Tumors with Adaptive Convolutional Neural Networks
Figure 3 for Automated Segmentation and Recurrence Risk Prediction of Surgically Resected Lung Tumors with Adaptive Convolutional Neural Networks
Figure 4 for Automated Segmentation and Recurrence Risk Prediction of Surgically Resected Lung Tumors with Adaptive Convolutional Neural Networks

Lung cancer is the leading cause of cancer related mortality by a significant margin. While new technologies, such as image segmentation, have been paramount to improved detection and earlier diagnoses, there are still significant challenges in treating the disease. In particular, despite an increased number of curative resections, many postoperative patients still develop recurrent lesions. Consequently, there is a significant need for prognostic tools that can more accurately predict a patient's risk for recurrence. In this paper, we explore the use of convolutional neural networks (CNNs) for the segmentation and recurrence risk prediction of lung tumors that are present in preoperative computed tomography (CT) images. First, expanding upon recent progress in medical image segmentation, a residual U-Net is used to localize and characterize each nodule. Then, the identified tumors are passed to a second CNN for recurrence risk prediction. The system's final results are produced with a random forest classifier that synthesizes the predictions of the second network with clinical attributes. The segmentation stage uses the LIDC-IDRI dataset and achieves a dice score of 70.3%. The recurrence risk stage uses the NLST dataset from the National Cancer institute and achieves an AUC of 73.0%. Our proposed framework demonstrates that first, automated nodule segmentation methods can generalize to enable pipelines for a wide range of multitask systems and second, that deep learning and image processing have the potential to improve current prognostic tools. To the best of our knowledge, it is the first fully automated segmentation and recurrence risk prediction system.

* 9 pages, 5 figures 
Viaarxiv icon

The Saddle-Point Accountant for Differential Privacy

Aug 20, 2022
Wael Alghamdi, Shahab Asoodeh, Flavio P. Calmon, Juan Felipe Gomez, Oliver Kosut, Lalitha Sankar, Fei Wei

Figure 1 for The Saddle-Point Accountant for Differential Privacy
Figure 2 for The Saddle-Point Accountant for Differential Privacy
Figure 3 for The Saddle-Point Accountant for Differential Privacy
Figure 4 for The Saddle-Point Accountant for Differential Privacy

We introduce a new differential privacy (DP) accountant called the saddle-point accountant (SPA). SPA approximates privacy guarantees for the composition of DP mechanisms in an accurate and fast manner. Our approach is inspired by the saddle-point method -- a ubiquitous numerical technique in statistics. We prove rigorous performance guarantees by deriving upper and lower bounds for the approximation error offered by SPA. The crux of SPA is a combination of large-deviation methods with central limit theorems, which we derive via exponentially tilting the privacy loss random variables corresponding to the DP mechanisms. One key advantage of SPA is that it runs in constant time for the $n$-fold composition of a privacy mechanism. Numerical experiments demonstrate that SPA achieves comparable accuracy to state-of-the-art accounting methods with a faster runtime.

* 31 pages, 4 figures 
Viaarxiv icon

Bottlenecks CLUB: Unifying Information-Theoretic Trade-offs Among Complexity, Leakage, and Utility

Jul 11, 2022
Behrooz Razeghi, Flavio P. Calmon, Deniz Gunduz, Slava Voloshynovskiy

Figure 1 for Bottlenecks CLUB: Unifying Information-Theoretic Trade-offs Among Complexity, Leakage, and Utility
Figure 2 for Bottlenecks CLUB: Unifying Information-Theoretic Trade-offs Among Complexity, Leakage, and Utility
Figure 3 for Bottlenecks CLUB: Unifying Information-Theoretic Trade-offs Among Complexity, Leakage, and Utility
Figure 4 for Bottlenecks CLUB: Unifying Information-Theoretic Trade-offs Among Complexity, Leakage, and Utility

Bottleneck problems are an important class of optimization problems that have recently gained increasing attention in the domain of machine learning and information theory. They are widely used in generative models, fair machine learning algorithms, design of privacy-assuring mechanisms, and appear as information-theoretic performance bounds in various multi-user communication problems. In this work, we propose a general family of optimization problems, termed as complexity-leakage-utility bottleneck (CLUB) model, which (i) provides a unified theoretical framework that generalizes most of the state-of-the-art literature for the information-theoretic privacy models, (ii) establishes a new interpretation of the popular generative and discriminative models, (iii) constructs new insights to the generative compression models, and (iv) can be used in the fair generative models. We first formulate the CLUB model as a complexity-constrained privacy-utility optimization problem. We then connect it with the closely related bottleneck problems, namely information bottleneck (IB), privacy funnel (PF), deterministic IB (DIB), conditional entropy bottleneck (CEB), and conditional PF (CPF). We show that the CLUB model generalizes all these problems as well as most other information-theoretic privacy models. Then, we construct the deep variational CLUB (DVCLUB) models by employing neural networks to parameterize variational approximations of the associated information quantities. Building upon these information quantities, we present unified objectives of the supervised and unsupervised DVCLUB models. Leveraging the DVCLUB model in an unsupervised setup, we then connect it with state-of-the-art generative models, such as variational auto-encoders (VAEs), generative adversarial networks (GANs), as well as the Wasserstein GAN (WGAN), Wasserstein auto-encoder (WAE), and adversarial auto-encoder (AAE) models through the optimal transport (OT) problem. We then show that the DVCLUB model can also be used in fair representation learning problems, where the goal is to mitigate the undesired bias during the training phase of a machine learning model. We conduct extensive quantitative experiments on colored-MNIST and CelebA datasets, with a public implementation available, to evaluate and analyze the CLUB model.

Viaarxiv icon