Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cho-Jui Hsieh

Efficient Contextual Representation Learning Without Softmax Layer

Feb 28, 2019

Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh, Kai-Wei Chang

Figure 1 for Efficient Contextual Representation Learning Without Softmax Layer

Figure 2 for Efficient Contextual Representation Learning Without Softmax Layer

Figure 3 for Efficient Contextual Representation Learning Without Softmax Layer

Figure 4 for Efficient Contextual Representation Learning Without Softmax Layer

Abstract:Contextual representation models have achieved great success in improving various downstream tasks. However, these language-model-based encoders are difficult to train due to the large parameter sizes and high computational complexity. By carefully examining the training procedure, we find that the softmax layer (the output layer) causes significant inefficiency due to the large vocabulary size. Therefore, we redesign the learning objective and propose an efficient framework for training contextual representation models. Specifically, the proposed approach bypasses the softmax layer by performing language modeling with dimension reduction, and allows the models to leverage pre-trained word embeddings. Our framework reduces the time spent on the output layer to a negligible level, eliminates almost all the trainable parameters of the softmax layer and performs language modeling without truncating the vocabulary. When applied to ELMo, our method achieves a 4 times speedup and eliminates 80% trainable parameters while achieving competitive performance on downstream tasks.

* Work in progress

Via

Access Paper or Ask Questions

Robust Decision Trees Against Adversarial Examples

Feb 27, 2019

Hongge Chen, Huan Zhang, Duane Boning, Cho-Jui Hsieh

Figure 1 for Robust Decision Trees Against Adversarial Examples

Figure 2 for Robust Decision Trees Against Adversarial Examples

Figure 3 for Robust Decision Trees Against Adversarial Examples

Figure 4 for Robust Decision Trees Against Adversarial Examples

Abstract:Although adversarial examples and model robustness have been extensively studied in the context of linear models and neural networks, research on this issue in tree-based models and how to make tree-based models robust against adversarial examples is still limited. In this paper, we show that tree based models are also vulnerable to adversarial examples and develop a novel algorithm to learn robust trees. At its core, our method aims to optimize the performance under the worst-case perturbation of input features, which leads to a max-min saddle point problem. Incorporating this saddle point objective into the decision tree building procedure is non-trivial due to the discrete nature of trees --- a naive approach to finding the best split according to this saddle point objective will take exponential time. To make our approach practical and scalable, we propose efficient tree building algorithms by approximating the inner minimizer in this saddle point problem, and present efficient implementations for classical information gain based trees as well as state-of-the-art tree boosting models such as XGBoost. Experimental results on real world datasets demonstrate that the proposed algorithms can substantially improve the robustness of tree-based models against adversarial examples.

Via

Access Paper or Ask Questions

A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

Feb 26, 2019

Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, Pengchuan Zhang

Figure 1 for A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

Figure 2 for A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

Figure 3 for A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

Figure 4 for A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

Abstract:Verification of neural networks enables us to gauge their robustness against adversarial attacks. Verification algorithms fall into two categories: exact verifiers that run in exponential time and relaxed verifiers that are efficient but incomplete. In this paper, we unify all existing LP-relaxed verifiers, to the best of our knowledge, under a general convex relaxation framework. This framework works for neural networks with diverse architectures and nonlinearities and covers both primal and dual views of robustness verification. We further prove strong duality between the primal and dual problems under very mild conditions. Next, we perform large-scale experiments, amounting to more than 22 CPU-years, to obtain exact solution to the convex-relaxed problem that is optimal within our framework for ReLU networks. We find the exact solution does not significantly improve upon the gap between PGD and existing relaxed verifiers for various networks trained normally or robustly on MNIST and CIFAR datasets. Our results suggest there is an inherent barrier to tight verification for the large class of methods captured by our framework. We discuss possible causes of this barrier and potential future directions for bypassing it.

Via

Access Paper or Ask Questions

Large-Batch Training for LSTM and Beyond

Jan 24, 2019

Yang You, Jonathan Hseu, Chris Ying, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

Figure 1 for Large-Batch Training for LSTM and Beyond

Figure 2 for Large-Batch Training for LSTM and Beyond

Figure 3 for Large-Batch Training for LSTM and Beyond

Figure 4 for Large-Batch Training for LSTM and Beyond

Abstract:Large-batch training approaches have enabled researchers to utilize large-scale distributed processing and greatly accelerate deep-neural net (DNN) training. For example, by scaling the batch size from 256 to 32K, researchers have been able to reduce the training time of ResNet50 on ImageNet from 29 hours to 2.2 minutes (Ying et al., 2018). In this paper, we propose a new approach called linear-epoch gradual-warmup (LEGW) for better large-batch training. With LEGW, we are able to conduct large-batch training for both CNNs and RNNs with the Sqrt Scaling scheme. LEGW enables Sqrt Scaling scheme to be useful in practice and as a result we achieve much better results than the Linear Scaling learning rate scheme. For LSTM applications, we are able to scale the batch size by a factor of 64 without losing accuracy and without tuning the hyper-parameters. For CNN applications, LEGW is able to achieve the same accuracy even as we scale the batch size to 32K. LEGW works better than previous large-batch auto-tuning techniques. LEGW achieves a 5.3X average speedup over the baselines for four LSTM-based applications on the same hardware. We also provide some theoretical explanations for LEGW.

* Preprint. Work in progress. We may update this draft recently

Via

Access Paper or Ask Questions

The Limitations of Adversarial Training and the Blind-Spot Attack

Jan 15, 2019

Huan Zhang, Hongge Chen, Zhao Song, Duane Boning, Inderjit S. Dhillon, Cho-Jui Hsieh

Figure 1 for The Limitations of Adversarial Training and the Blind-Spot Attack

Figure 2 for The Limitations of Adversarial Training and the Blind-Spot Attack

Figure 3 for The Limitations of Adversarial Training and the Blind-Spot Attack

Figure 4 for The Limitations of Adversarial Training and the Blind-Spot Attack

Abstract:The adversarial training procedure proposed by Madry et al. (2018) is one of the most effective methods to defend against adversarial examples in deep neural networks (DNNs). In our paper, we shed some lights on the practicality and the hardness of adversarial training by showing that the effectiveness (robustness on test set) of adversarial training has a strong correlation with the distance between a test point and the manifold of training data embedded by the network. Test examples that are relatively far away from this manifold are more likely to be vulnerable to adversarial attacks. Consequentially, an adversarial training based defense is susceptible to a new class of attacks, the "blind-spot attack", where the input images reside in "blind-spots" (low density regions) of the empirical distribution of training data but is still on the ground-truth data manifold. For MNIST, we found that these blind-spots can be easily found by simply scaling and shifting image pixel values. Most importantly, for large datasets with high dimensional and complex data manifold (CIFAR, ImageNet, etc), the existence of blind-spots in adversarial training makes defending on any valid test examples difficult due to the curse of dimensionality and the scarcity of training data. Additionally, we find that blind-spots also exist on provable defenses including (Wong & Kolter, 2018) and (Sinha et al., 2018) because these trainable robustness certificates can only be practically optimized on a limited set of training data.

* Accepted by International Conference on Learning Representations (ICLR) 2019. Huan Zhang and Hongge Chen contributed equally

Via

Access Paper or Ask Questions

Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Dec 09, 2018

Yao Li, Martin Renqiang Min, Wenchao Yu, Cho-Jui Hsieh, Thomas C. M. Lee, Erik Kruus

Figure 1 for Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Figure 2 for Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Figure 3 for Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Figure 4 for Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Abstract:Recent studies have demonstrated the vulnerability of deep convolutional neural networks against adversarial examples. Inspired by the observation that the intrinsic dimension of image data is much smaller than its pixel space dimension and the vulnerability of neural networks grows with the input dimension, we propose to embed high-dimensional input images into a low-dimensional space to perform classification. However, arbitrarily projecting the input images to a low-dimensional space without regularization will not improve the robustness of deep neural networks. Leveraging optimal transport theory, we propose a new framework, Optimal Transport Classifier (OT-Classifier), and derive an objective that minimizes the discrepancy between the distribution of the true label and the distribution of the OT-Classifier output. Experimental results on several benchmark datasets show that, our proposed framework achieves state-of-the-art performance against strong adversarial attack methods.

* 9 pages

Via

Access Paper or Ask Questions

Block-wise Partitioning for Extreme Multi-label Classification

Nov 04, 2018

Yuefeng Liang, Cho-Jui Hsieh, Thomas C. M. Lee

Figure 1 for Block-wise Partitioning for Extreme Multi-label Classification

Figure 2 for Block-wise Partitioning for Extreme Multi-label Classification

Figure 3 for Block-wise Partitioning for Extreme Multi-label Classification

Figure 4 for Block-wise Partitioning for Extreme Multi-label Classification

Abstract:Extreme multi-label classification aims to learn a classifier that annotates an instance with a relevant subset of labels from an extremely large label set. Many existing solutions embed the label matrix to a low-dimensional linear subspace, or examine the relevance of a test instance to every label via a linear scan. In practice, however, those approaches can be computationally exorbitant. To alleviate this drawback, we propose a Block-wise Partitioning (BP) pretreatment that divides all instances into disjoint clusters, to each of which the most frequently tagged label subset is attached. One multi-label classifier is trained on one pair of instance and label clusters, and the label set of a test instance is predicted by first delivering it to the most appropriate instance cluster. Experiments on benchmark multi-label data sets reveal that BP pretreatment significantly reduces prediction time, and retains almost the same level of prediction accuracy.

Via

Access Paper or Ask Questions

Efficient Neural Network Robustness Certification with General Activation Functions

Nov 02, 2018

Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, Luca Daniel

Figure 1 for Efficient Neural Network Robustness Certification with General Activation Functions

Figure 2 for Efficient Neural Network Robustness Certification with General Activation Functions

Figure 3 for Efficient Neural Network Robustness Certification with General Activation Functions

Figure 4 for Efficient Neural Network Robustness Certification with General Activation Functions

Abstract:Finding minimum distortion of adversarial examples and thus certifying robustness in neural network classifiers for given data points is known to be a challenging problem. Nevertheless, recently it has been shown to be possible to give a non-trivial certified lower bound of minimum adversarial distortion, and some recent progress has been made towards this direction by exploiting the piece-wise linear nature of ReLU activations. However, a generic robustness certification for general activation functions still remains largely unexplored. To address this issue, in this paper we introduce CROWN, a general framework to certify robustness of neural networks with general activation functions for given input data points. The novelty in our algorithm consists of bounding a given activation function with linear and quadratic functions, hence allowing it to tackle general activation functions including but not limited to four popular choices: ReLU, tanh, sigmoid and arctan. In addition, we facilitate the search for a tighter certified lower bound by adaptively selecting appropriate surrogates for each neuron activation. Experimental results show that CROWN on ReLU networks can notably improve the certified lower bounds compared to the current state-of-the-art algorithm Fast-Lin, while having comparable computational efficiency. Furthermore, CROWN also demonstrates its effectiveness and flexibility on networks with general activation functions, including tanh, sigmoid and arctan.

* Accepted by NIPS 2018. Huan Zhang and Tsui-Wei Weng contributed equally

Via

Access Paper or Ask Questions

Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Oct 29, 2018

Patrick H. Chen, Si Si, Sanjiv Kumar, Yang Li, Cho-Jui Hsieh

Figure 1 for Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Figure 2 for Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Figure 3 for Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Figure 4 for Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks

Abstract:Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax layer. In this paper, we introduce a novel softmax layer approximation algorithm by exploiting the clustering structure of context vectors. Our algorithm uses a light-weight screening model to predict a much smaller set of candidate words based on the given context, and then conducts an exact softmax only within that subset. Training such a procedure end-to-end is challenging as traditional clustering methods are discrete and non-differentiable, and thus unable to be used with back-propagation in the training process. Using the Gumbel softmax, we are able to train the screening model end-to-end on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the original softmax layer for predicting top-$k$ words in various tasks such as beam search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary, we can achieve 20.4 times speed up with 98.9\% precision@1 and 99.3\% precision@5 with the original softmax layer prediction, while state-of-the-art ~\citep{MSRprediction} only achieves 6.7x speedup with 98.7\% precision@1 and 98.1\% precision@5 for the same task.

Via

Access Paper or Ask Questions

RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

Oct 28, 2018

Huan Zhang, Pengchuan Zhang, Cho-Jui Hsieh

Figure 1 for RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

Figure 2 for RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

Figure 3 for RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

Figure 4 for RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

Abstract:The Jacobian matrix (or the gradient for single-output networks) is directly related to many important properties of neural networks, such as the function landscape, stationary points, (local) Lipschitz constants and robustness to adversarial attacks. In this paper, we propose a recursive algorithm, RecurJac, to compute both upper and lower bounds for each element in the Jacobian matrix of a neural network with respect to network's input, and the network can contain a wide range of activation functions. As a byproduct, we can efficiently obtain a (local) Lipschitz constant, which plays a crucial role in neural network robustness verification, as well as the training stability of GANs. Experiments show that (local) Lipschitz constants produced by our method is of better quality than previous approaches, thus providing better robustness verification results. Our algorithm has polynomial time complexity, and its computation time is reasonable even for relatively large networks. Additionally, we use our bounds of Jacobian matrix to characterize the landscape of the neural network, for example, to determine whether there exist stationary points in a local neighborhood. Source code available at https://github.com/huanzhang12/RecurJac-Jacobian-Bounds.

* Work done during internship at Microsoft Research

Via

Access Paper or Ask Questions