Alert button
Picture for Renkun Ni

Renkun Ni

Alert button

K-SAM: Sharpness-Aware Minimization at the Speed of SGD

Oct 23, 2022
Renkun Ni, Ping-yeh Chiang, Jonas Geiping, Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein

Figure 1 for K-SAM: Sharpness-Aware Minimization at the Speed of SGD
Figure 2 for K-SAM: Sharpness-Aware Minimization at the Speed of SGD
Figure 3 for K-SAM: Sharpness-Aware Minimization at the Speed of SGD
Figure 4 for K-SAM: Sharpness-Aware Minimization at the Speed of SGD

Sharpness-Aware Minimization (SAM) has recently emerged as a robust technique for improving the accuracy of deep neural networks. However, SAM incurs a high computational cost in practice, requiring up to twice as much computation as vanilla SGD. The computational challenge posed by SAM arises because each iteration requires both ascent and descent steps and thus double the gradient computations. To address this challenge, we propose to compute gradients in both stages of SAM on only the top-k samples with highest loss. K-SAM is simple and extremely easy-to-implement while providing significant generalization boosts over vanilla SGD at little to no additional cost.

* 13 pages, 2 figures 
Viaarxiv icon

GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

Feb 16, 2021
Chen Zhu, Renkun Ni, Zheng Xu, Kezhi Kong, W. Ronny Huang, Tom Goldstein

Figure 1 for GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training
Figure 2 for GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training
Figure 3 for GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training
Figure 4 for GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

Changes in neural architectures have fostered significant breakthroughs in language modeling and computer vision. Unfortunately, novel architectures often require re-thinking the choice of hyperparameters (e.g., learning rate, warmup schedule, and momentum coefficients) to maintain stability of the optimizer. This optimizer instability is often the result of poor parameter initialization, and can be avoided by architecture-specific initialization schemes. In this paper, we present GradInit, an automated and architecture agnostic method for initializing neural networks. GradInit is based on a simple heuristic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. This adjustment is done by introducing a scalar multiplier variable in front of each parameter block, and then optimizing these variables using a simple numerical scheme. GradInit accelerates the convergence and test performance of many convolutional architectures, both with or without skip connections, and even without normalization layers. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup under a wide range of learning rates and momentum coefficients. Code is available at https://github.com/zhuchen03/gradinit.

Viaarxiv icon

Data Augmentation for Meta-Learning

Oct 14, 2020
Renkun Ni, Micah Goldblum, Amr Sharaf, Kezhi Kong, Tom Goldstein

Figure 1 for Data Augmentation for Meta-Learning
Figure 2 for Data Augmentation for Meta-Learning
Figure 3 for Data Augmentation for Meta-Learning
Figure 4 for Data Augmentation for Meta-Learning

Conventional image classifiers are trained by randomly sampling mini-batches of images. To achieve state-of-the-art performance, sophisticated data augmentation schemes are used to expand the amount of training data available for sampling. In contrast, meta-learning algorithms sample not only images, but classes as well. We investigate how data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes. We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels. Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.

Viaarxiv icon

WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic

Jul 26, 2020
Renkun Ni, Hong-min Chu, Oscar Castañeda, Ping-yeh Chiang, Christoph Studer, Tom Goldstein

Figure 1 for WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic
Figure 2 for WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic
Figure 3 for WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic
Figure 4 for WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic

Low-resolution neural networks represent both weights and activations with few bits, drastically reducing the multiplication complexity. Nonetheless, these products are accumulated using high-resolution (typically 32-bit) additions, an operation that dominates the arithmetic complexity of inference when using extreme quantization (e.g., binary weights). To further optimize inference, we propose a method that adapts neural networks to use low-resolution (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts. We achieve resilience to low-resolution accumulation by inserting a cyclic activation layer, as well as an overflow penalty regularizer. We demonstrate the efficacy of our approach on both software and hardware platforms.

Viaarxiv icon

Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks

Mar 21, 2020
Micah Goldblum, Steven Reich, Liam Fowl, Renkun Ni, Valeriia Cherepanova, Tom Goldstein

Figure 1 for Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks
Figure 2 for Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks
Figure 3 for Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks
Figure 4 for Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks

Meta-learning algorithms produce feature extractors which achieve state-of-the-art performance on few-shot classification. While the literature is rich with meta-learning methods, little is known about why the resulting feature extractors perform so well. We develop a better understanding of the underlying mechanics of meta-learning and the difference between models trained using meta-learning and models which are trained classically. In doing so, we develop several hypotheses for why meta-learned models perform better. In addition to visualizations, we design several regularizers inspired by our hypotheses which improve performance on few-shot classification.

Viaarxiv icon

Certified Defenses for Adversarial Patches

Mar 14, 2020
Ping-Yeh Chiang, Renkun Ni, Ahmed Abdelkader, Chen Zhu, Christoph Studor, Tom Goldstein

Figure 1 for Certified Defenses for Adversarial Patches
Figure 2 for Certified Defenses for Adversarial Patches
Figure 3 for Certified Defenses for Adversarial Patches
Figure 4 for Certified Defenses for Adversarial Patches

Adversarial patch attacks are among one of the most practical threat models against real-world computer vision systems. This paper studies certified and empirical defenses against patch attacks. We begin with a set of experiments showing that most existing defenses, which work by pre-processing input images to mitigate adversarial patches, are easily broken by simple white-box adversaries. Motivated by this finding, we propose the first certified defense against patch attacks, and propose faster methods for its training. Furthermore, we experiment with different patch shapes for testing, obtaining surprisingly good robustness transfer across shapes, and present preliminary results on certified defense against sparse attacks. Our complete implementation can be found on: https://github.com/Ping-C/certifiedpatchdefense.

* to be published in International Conference on Learning Representations, ICLR 2020 
Viaarxiv icon

Improving the Tightness of Convex Relaxation Bounds for Training Certifiably Robust Classifiers

Feb 22, 2020
Chen Zhu, Renkun Ni, Ping-yeh Chiang, Hengduo Li, Furong Huang, Tom Goldstein

Figure 1 for Improving the Tightness of Convex Relaxation Bounds for Training Certifiably Robust Classifiers
Figure 2 for Improving the Tightness of Convex Relaxation Bounds for Training Certifiably Robust Classifiers
Figure 3 for Improving the Tightness of Convex Relaxation Bounds for Training Certifiably Robust Classifiers
Figure 4 for Improving the Tightness of Convex Relaxation Bounds for Training Certifiably Robust Classifiers

Convex relaxations are effective for training and certifying neural networks against norm-bounded adversarial attacks, but they leave a large gap between certifiable and empirical robustness. In principle, convex relaxation can provide tight bounds if the solution to the relaxed problem is feasible for the original non-convex problem. We propose two regularizers that can be used to train neural networks that yield tighter convex relaxation bounds for robustness. In all of our experiments, the proposed regularizers result in higher certified accuracy than non-regularized baselines.

Viaarxiv icon

WITCHcraft: Efficient PGD attacks with random step size

Nov 18, 2019
Ping-Yeh Chiang, Jonas Geiping, Micah Goldblum, Tom Goldstein, Renkun Ni, Steven Reich, Ali Shafahi

Figure 1 for WITCHcraft: Efficient PGD attacks with random step size
Figure 2 for WITCHcraft: Efficient PGD attacks with random step size
Figure 3 for WITCHcraft: Efficient PGD attacks with random step size
Figure 4 for WITCHcraft: Efficient PGD attacks with random step size

State-of-the-art adversarial attacks on neural networks use expensive iterative methods and numerous random restarts from different initial points. Iterative FGSM-based methods without restarts trade off performance for computational efficiency because they do not adequately explore the image space and are highly sensitive to the choice of step size. We propose a variant of Projected Gradient Descent (PGD) that uses a random step size to improve performance without resorting to expensive random restarts. Our method, Wide Iterative Stochastic crafting (WITCHcraft), achieves results superior to the classical PGD attack on the CIFAR-10 and MNIST data sets but without additional computational cost. This simple modification of PGD makes crafting attacks more economical, which is important in situations like adversarial training where attacks need to be crafted in real time.

* Authors contributed equally and are listed in alphabetical order 
Viaarxiv icon

Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization

Aug 03, 2017
Yinpeng Dong, Renkun Ni, Jianguo Li, Yurong Chen, Jun Zhu, Hang Su

Figure 1 for Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization
Figure 2 for Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization
Figure 3 for Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization

Low-bit deep neural networks (DNNs) become critical for embedded applications due to their low storage requirement and computing efficiency. However, they suffer much from the non-negligible accuracy drop. This paper proposes the stochastic quantization (SQ) algorithm for learning accurate low-bit DNNs. The motivation is due to the following observation. Existing training algorithms approximate the real-valued elements/filters with low-bit representation all together in each iteration. The quantization errors may be small for some elements/filters, while are remarkable for others, which lead to inappropriate gradient direction during training, and thus bring notable accuracy drop. Instead, SQ quantizes a portion of elements/filters to low-bit with a stochastic probability inversely proportional to the quantization error, while keeping the other portion unchanged with full-precision. The quantized and full-precision portions are updated with corresponding gradients separately in each iteration. The SQ ratio is gradually increased until the whole network is quantized. This procedure can greatly compensate the quantization error and thus yield better accuracy for low-bit DNNs. Experiments show that SQ can consistently and significantly improve the accuracy for different low-bit DNNs on various datasets and various network structures.

* BMVC 2017 Oral 
Viaarxiv icon