Alert button
Picture for D. Sculley

D. Sculley

Alert button

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

May 22, 2023
Alicia Parrish, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Max Bartolo, Oana Inel, Juan Ciro, Rafael Mosquera, Addison Howard, Will Cukierski, D. Sculley, Vijay Janapa Reddi, Lora Aroyo

Figure 1 for Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models
Figure 2 for Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models
Figure 3 for Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential to cause wide-reaching harm, for example, through generated images which are violent, sexually explicit, or contain biased and derogatory stereotypes. Despite this risk of harm, we lack systematic and structured evaluation datasets to scrutinize model behavior, especially adversarial attacks that bypass existing safety filters. A typical bottleneck in safety evaluation is achieving a wide coverage of different types of challenging examples in the evaluation set, i.e., identifying 'unknown unknowns' or long-tail problems. To address this need, we introduce the Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a diverse set of failure modes and reward challenge participants for successfully finding safety vulnerabilities in current state-of-the-art T2I models. Ultimately, we aim to provide greater awareness of these issues and assist developers in improving the future safety and reliability of generative AI models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.

Viaarxiv icon

Plex: Towards Reliability using Pretrained Large Model Extensions

Jul 15, 2022
Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado, Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, Balaji Lakshminarayanan

Figure 1 for Plex: Towards Reliability using Pretrained Large Model Extensions
Figure 2 for Plex: Towards Reliability using Pretrained Large Model Extensions
Figure 3 for Plex: Towards Reliability using Pretrained Large Model Extensions
Figure 4 for Plex: Towards Reliability using Pretrained Large Model Extensions

A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it improves the out-of-the-box performance and does not require designing scores or tuning the model for each task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.

* Code available at https://goo.gle/plex-code 
Viaarxiv icon

Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Jun 07, 2021
Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jasper Snoek, Yarin Gal, Dustin Tran

Figure 1 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning
Figure 2 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning
Figure 3 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning
Figure 4 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. Code available at https://github.com/google/uncertainty-baselines.

Viaarxiv icon

Underspecification Presents Challenges for Credibility in Modern Machine Learning

Nov 06, 2020
Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley

Figure 1 for Underspecification Presents Challenges for Credibility in Modern Machine Learning
Figure 2 for Underspecification Presents Challenges for Credibility in Modern Machine Learning
Figure 3 for Underspecification Presents Challenges for Credibility in Modern Machine Learning
Figure 4 for Underspecification Presents Challenges for Credibility in Modern Machine Learning

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

Viaarxiv icon

Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift

Jul 17, 2020
Zachary Nado, Shreyas Padhy, D. Sculley, Alexander D'Amour, Balaji Lakshminarayanan, Jasper Snoek

Figure 1 for Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift
Figure 2 for Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift
Figure 3 for Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift
Figure 4 for Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift

Covariate shift has been shown to sharply degrade both predictive accuracy and the calibration of uncertainty estimates for deep learning models. This is worrying, because covariate shift is prevalent in a wide range of real world deployment settings. However, in this paper, we note that frequently there exists the potential to access small unlabeled batches of the shifted data just before prediction time. This interesting observation enables a simple but surprisingly effective method which we call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift. Using this one line code change, we achieve state-of-the-art on recent covariate shift benchmarks and an mCE of 60.28\% on the challenging ImageNet-C dataset; to our knowledge, this is the best result for any model that does not incorporate additional data augmentation or modification of the training pipeline. We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness (e.g. deep ensembles) and combining the two further improves performance. Our findings are supported by detailed measurements of the effect of this strategy on model behavior across rigorous ablations on various dataset modalities. However, the method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift, and is therefore worthy of additional study. We include links to the data in our figures to improve reproducibility, including a Python notebooks that can be run to easily modify our analysis at https://colab.research.google.com/drive/11N0wDZnMQQuLrRwRoumDCrhSaIhkqjof.

Viaarxiv icon

TensorFlow.js: Machine Learning for the Web and Beyond

Jan 16, 2019
Daniel Smilkov, Nikhil Thorat, Yannick Assogba, Ann Yuan, Nick Kreeger, Ping Yu, Kangyi Zhang, Shanqing Cai, Eric Nielsen, David Soergel, Stan Bileschi, Michael Terry, Charles Nicholson, Sandeep N. Gupta, Sarah Sirajuddin, D. Sculley, Rajat Monga, Greg Corrado, Fernanda B. Viegas, Martin Wattenberg

Figure 1 for TensorFlow.js: Machine Learning for the Web and Beyond
Figure 2 for TensorFlow.js: Machine Learning for the Web and Beyond
Figure 3 for TensorFlow.js: Machine Learning for the Web and Beyond
Figure 4 for TensorFlow.js: Machine Learning for the Web and Beyond

TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set of developers from the extensive JavaScript community to build and deploy machine learning models and enabled new classes of on-device computation. This paper describes the design, API, and implementation of TensorFlow.js, and highlights some of the impactful use cases.

* 10 pages 
Viaarxiv icon

BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity

Dec 17, 2018
Alexey A. Gritsenko, Alex D'Amour, James Atwood, Yoni Halpern, D. Sculley

Figure 1 for BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity
Figure 2 for BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity
Figure 3 for BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity
Figure 4 for BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity

We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers. The patches encourage internal model representations not to encode sensitive information, which has the effect of pushing downstream predictors towards exhibiting demographic parity with respect to the sensitive information. The net result is that these BriarPatches provide an intervention mechanism available at user level, and complements prior research on fair representations that were previously only applicable by model developers and ML experts.

* 6 pages, 5 figures, NeurIPS Workshop on Ethical, Social and Governance Issues in AI 
Viaarxiv icon

Predicting Electron-Ionization Mass Spectrometry using Neural Networks

Nov 21, 2018
Jennifer N. Wei, David Belanger, Ryan P. Adams, D. Sculley

Figure 1 for Predicting Electron-Ionization Mass Spectrometry using Neural Networks
Figure 2 for Predicting Electron-Ionization Mass Spectrometry using Neural Networks
Figure 3 for Predicting Electron-Ionization Mass Spectrometry using Neural Networks
Figure 4 for Predicting Electron-Ionization Mass Spectrometry using Neural Networks

When confronted with a substance of unknown identity, researchers often perform mass spectrometry on the sample and compare the observed spectrum to a library of previously-collected spectra to identify the molecule. While popular, this approach will fail to identify molecules that are not in the existing library. In response, we propose to improve the library's coverage by augmenting it with synthetic spectra that are predicted using machine learning. We contribute a lightweight neural network model that quickly predicts mass spectra for small molecules. Achieving high accuracy predictions requires a novel neural network architecture that is designed to capture typical fragmentation patterns from electron ionization. We analyze the effects of our modeling innovations on library matching performance and compare our models to prior machine learning-based work on spectrum prediction.

* 12 pages, 5 figures, accepted to Machine Learning for Molecules and Materials Workshop at NeurIPS 2018 
Viaarxiv icon

No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World

Nov 22, 2017
Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, D. Sculley

Figure 1 for No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World
Figure 2 for No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World
Figure 3 for No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World
Figure 4 for No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World

Modern machine learning systems such as image classifiers rely heavily on large scale data sets for training. Such data sets are costly to create, thus in practice a small number of freely available, open source data sets are widely used. We suggest that examining the geo-diversity of open data sets is critical before adopting a data set for use cases in the developing world. We analyze two large, publicly available image data sets to assess geo-diversity and find that these data sets appear to exhibit an observable amerocentric and eurocentric representation bias. Further, we analyze classifiers trained on these data sets to assess the impact of these training distributions and find strong differences in the relative performance on images from different locales. These results emphasize the need to ensure geo-representation when constructing data sets for use in the developing world.

* Presented at NIPS 2017 Workshop on Machine Learning for the Developing World 
Viaarxiv icon