We study the difficulties in learning that arise from robust and differentially private optimization. We first study convergence of gradient descent based adversarial training with differential privacy, taking a simple binary classification task on linearly separable data as an illustrative example. We compare the gap between adversarial and nominal risk in both private and non-private settings, showing that the data dimensionality dependent term introduced by private optimization compounds the difficulties of learning a robust model. After this, we discuss what parts of adversarial training and differential privacy hurt optimization, identifying that the size of adversarial perturbation and clipping norm in differential privacy both increase the curvature of the loss landscape, implying poorer generalization performance.
Advancements in deep generative models have made it possible to synthesize images, videos and audio signals that are difficult to distinguish from natural signals, creating opportunities for potential abuse of these capabilities. This motivates the problem of tracking the provenance of signals, i.e., being able to determine the original source of a signal. Watermarking the signal at the time of signal creation is a potential solution, but current techniques are brittle and watermark detection mechanisms can easily be bypassed by applying post-processing transformations (cropping images, shifting pitch in the audio etc.). In this paper, we introduce ReSWAT (Resilient Signal Watermarking via Adversarial Training), a framework for learning transformation-resilient watermark detectors that are able to detect a watermark even after a signal has been through several post-processing transformations. Our detection method can be applied to domains with continuous data representations such as images, videos or sound signals. Experiments on watermarking image and audio signals show that our method can reliably detect the provenance of a signal, even if it has been through several post-processing transformations, and improve upon related work in this setting. Furthermore, we show that for specific kinds of transformations (perturbations bounded in the L2 norm), we can even get formal guarantees on the ability of our model to detect the watermark. We provide qualitative examples of watermarked image and audio samples in https://drive.google.com/open?id=1-yZ0WIGNu2Iez7UpXBjtjVgZu3jJjFga.
The widespread adoption of encrypted communications (e.g., the TLS protocol, the Tor anonymity network) fixed several critical security flaws and shielded the end-users from adversaries intercepting their transmitted data. While these protocols are very effective in protecting the confidentiality of the users' data (e.g., credit card numbers), it has been shown that they are prone (to different degrees) to adversaries aiming to breach the users' privacy. Traffic fingerprinting attacks allow an adversary to infer the webpage or the website loaded by a user based only on patterns in the user's encrypted traffic. In fact, many recent works managed to achieve a very high classification accuracy under optimal conditions for the adversary. This paper revisits the optimality assumptions made by those works and discusses various additional parameters that should be considered when evaluating a fingerprinting model. We propose three realistic scenarios simulating non-optimal fingerprinting conditions where various factors could affect the adversary's performance or operation. We then introduce a novel adaptive fingerprinting adversary and experimentally evaluate its accuracy and operation. Our experiments show that adaptive adversaries can reliably uncover the webpage visited by a user among several thousand potential pages, even under considerable distributional shift (e.g., the webpage contents change significantly over time). Such adversaries could infer the products a user browses on shopping websites or log the browsing habits of state dissidents on online forums and encyclopedias. Our technique achieves ~90% accuracy in a top-15 setting where the model distinguishes the article visited out of 6,000 Wikipedia webpages, while the same model achieves ~80% accuracy in a dataset of 13,000 classes that were not included in the training set.
Federated Learning (FL) allows multiple participants to collaboratively train machine learning models by keeping their datasets local and exchanging model updates. Recent work has highlighted weaknesses related to robustness and privacy in FL, including backdoor, membership and property inference attacks. In this paper, we investigate whether and how Differential Privacy (DP) can be used to defend against attacks targeting both robustness and privacy in FL. To this end, we present a first-of-its-kind experimental evaluation of Local and Central Differential Privacy (LDP/CDP), assessing their feasibility and effectiveness. We show that both LDP and CDP do defend against backdoor attacks, with varying levels of protection and utility, and overall more effectively than non-DP defenses. They also mitigate white-box membership inference attacks, which our work is the first to show. Neither, however, defend against property inference attacks, prompting the need for further research in this space. Overall, our work also provides a re-usable measurement framework to quantify the trade-offs between robustness/privacy and utility in differentially private FL.
Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will provably overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to formally characterize how the training set size factors into the privacy risks exposed by training a robust model. Finally, we empirically show our findings hold on image classification benchmark datasets, such as CIFAR-10.
Randomized smoothing, a method to certify a classifier's decision on an input is invariant under adversarial noise, offers attractive advantages over other certification methods. It operates in a black-box and so certification is not constrained by the size of the classifier's architecture. Here, we extend the work of Li et al. \cite{li2018second}, studying how the choice of divergence between smoothing measures affects the final robustness guarantee, and how the choice of smoothing measure itself can lead to guarantees in differing threat models. To this end, we develop a method to certify robustness against any $\ell_p$ ($p\in\mathbb{N}_{>0}$) minimized adversarial perturbation. We then demonstrate a negative result, that randomized smoothing suffers from the curse of dimensionality; as $p$ increases, the effective radius around an input one can certify vanishes.
Machine learning models are vulnerable to adversarial perturbations, that when added to an input, can cause high confidence misclassifications. The adversarial learning research community has made remarkable progress in the understanding of the root causes of adversarial perturbations. However, most problems that one may consider important to solve for the deployment of machine learning in safety critical tasks involve high dimensional complex manifolds that are difficult to characterize and study. It is common to develop adversarially robust learning theory on simple problems, in the hope that insights will transfer to `real world datasets'. In this work, we discuss a setting where this approach fails. In particular, we show with a linear classifier, it is always possible to solve a binary classification problem on Gaussian data under arbitrary levels of adversarial corruption during training, and that this property is not observed with non-linear classifiers on the CIFAR-10 dataset.
Machine learning is data hungry; the more data a model has access to in training, the more likely it is to perform well at inference time. Distinct parties may want to combine their local data to gain the benefits of a model trained on a large corpus of data. We consider such a case: parties get access to the model trained on their joint data but do not see each others individual datasets. We show that one needs to be careful when using this multi-party model since a potentially malicious party can taint the model by providing contaminated data. We then show how adversarial training can defend against such attacks by preventing the model from learning trends specific to individual parties data, thereby also guaranteeing party-level membership privacy.
Since Biggio et al. (2013) and Szegedy et al. (2013) first drew attention to adversarial examples, there has been a flood of research into defending and attacking machine learning models. However, almost all proposed attacks assume white-box access to a model. In other words, the attacker is assumed to have perfect knowledge of the models weights and architecture. With this insider knowledge, a white-box attack can leverage gradient information to craft adversarial examples. Black-box attacks assume no knowledge of the model weights or architecture. These attacks craft adversarial examples using information only contained in the logits or hard classification label. Here, we assume the attacker can use the logits in order to find an adversarial example. Empirically, we show that 2-sided stochastic gradient estimation techniques are not sensitive to scaling parameters, and can be used to mount powerful black-box attacks requiring relatively few model queries.
Security-critical applications such as malware, fraud, or spam detection, require machine learning models that operate on examples from constrained discrete domains. In these settings, gradient-based attacks that rely on adding perturbations often fail to produce adversarial examples that meet the domain constraints, and thus are not effective. We introduce a graphical framework that (1) formalizes existing attacks in discrete domains, (2) efficiently produces valid adversarial examples with guarantees of minimal cost, and (3) can accommodate complex cost functions beyond the commonly used p-norm. We demonstrate the effectiveness of this method by crafting adversarial examples that evade a Twitter bot detection classifier using a provably minimal number of changes.