Alert button
Picture for Fartash Faghri

Fartash Faghri

Alert button

Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement

Mar 15, 2023
Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, Oncel Tuzel

Figure 1 for Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Figure 2 for Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Figure 3 for Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Figure 4 for Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement

We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-based models and performing large-scale study of distillation with state-of-the-art models with various data augmentations. We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks (e.g., segmentation and detection). As an example, the accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the ImageNet validation set is also reduced by 9.9%. Using this backbone with Mask-RCNN for object detection on MS-COCO, the mean average precision improves by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers. For MobileNetV3 and Swin-Tiny we observe significant improvements on ImageNet-R/A/C of up to 10% improved robustness. Models pretrained on ImageNet+ and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4% improved accuracy.

Viaarxiv icon

FastFill: Efficient Compatible Model Update

Mar 08, 2023
Florian Jaeckle, Fartash Faghri, Ali Farhadi, Oncel Tuzel, Hadi Pouransari

Figure 1 for FastFill: Efficient Compatible Model Update
Figure 2 for FastFill: Efficient Compatible Model Update
Figure 3 for FastFill: Efficient Compatible Model Update
Figure 4 for FastFill: Efficient Compatible Model Update

In many retrieval systems the original high dimensional data (e.g., images) is mapped to a lower dimensional feature through a learned embedding model. The task of retrieving the most similar data from a gallery set to a given query data is performed through a similarity comparison on features. When the embedding model is updated, it might produce features that are not comparable/compatible with features already in the gallery computed with the old model. Subsequently, all features in the gallery need to be re-computed using the new embedding model -- a computationally expensive process called backfilling. Recently, compatible representation learning methods have been proposed to avoid backfilling. Despite their relative success, there is an inherent trade-off between the new model performance and its compatibility with the old model. In this work, we introduce FastFill: a compatible model update process using feature alignment and policy based partial backfilling to promptly elevate retrieval performance. We show that previous backfilling strategies suffer from decreased performance and demonstrate the importance of both the training objective and the ordering in online partial backfilling. We propose a new training method for feature alignment between old and new embedding models using uncertainty estimation. Compared to previous works, we obtain significantly improved backfilling results on a variety of datasets: mAP on ImageNet (+4.4\%), Places-365 (+2.7\%), and VGG-Face2 (+1.3\%). Further, we demonstrate that when updating a biased model with FastFill, the minority subgroup accuracy gap promptly vanishes with a small fraction of partial backfilling.

* To appear in The Eleventh International Conference on Learning Representations 
Viaarxiv icon

RangeAugment: Efficient Online Augmentation with Range Learning

Dec 20, 2022
Sachin Mehta, Saeid Naderiparizi, Fartash Faghri, Maxwell Horton, Lailin Chen, Ali Farhadi, Oncel Tuzel, Mohammad Rastegari

Figure 1 for RangeAugment: Efficient Online Augmentation with Range Learning
Figure 2 for RangeAugment: Efficient Online Augmentation with Range Learning
Figure 3 for RangeAugment: Efficient Online Augmentation with Range Learning
Figure 4 for RangeAugment: Efficient Online Augmentation with Range Learning

State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges for each operation, which may lead to sub-optimal policies. To answer the open question on the importance of magnitude ranges for each augmentation operation, we introduce RangeAugment that allows us to efficiently learn the range of magnitudes for individual as well as composite augmentation operations. RangeAugment uses an auxiliary loss based on image similarity as a measure to control the range of magnitudes of augmentation operations. As a result, RangeAugment has a single scalar parameter for search, image similarity, which we simply optimize via linear search. RangeAugment integrates seamlessly with any model and learns model- and task-specific augmentation policies. With extensive experiments on the ImageNet dataset across different networks, we show that RangeAugment achieves competitive performance to state-of-the-art automatic augmentation methods with 4-5 times fewer augmentation operations. Experimental results on semantic segmentation, object detection, foundation models, and knowledge distillation further shows RangeAugment's effectiveness.

* Technical report (22 pages including references and appendix) 
Viaarxiv icon

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Oct 08, 2022
Elan Rosenfeld, Preetum Nakkiran, Hadi Pouransari, Oncel Tuzel, Fartash Faghri

Figure 1 for APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations
Figure 2 for APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations
Figure 3 for APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations
Figure 4 for APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a properly chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two orders of magnitude less time and data and training 77% fewer parameters.

Viaarxiv icon

MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks

Jul 16, 2022
Ali Ramezani-Kebrya, Iman Tabrizian, Fartash Faghri, Petar Popovski

Figure 1 for MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks
Figure 2 for MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks
Figure 3 for MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks
Figure 4 for MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks

Implementations of SGD on distributed and multi-GPU systems creates new vulnerabilities, which can be identified and misused by one or more adversarial agents. Recently, it has been shown that well-known Byzantine-resilient gradient aggregation schemes are indeed vulnerable to informed attackers that can tailor the attacks (Fang et al., 2020; Xie et al., 2020b). We introduce MixTailor, a scheme based on randomization of the aggregation strategies that makes it impossible for the attacker to be fully informed. Deterministic schemes can be integrated into MixTailor on the fly without introducing any additional hyperparameters. Randomization decreases the capability of a powerful adversary to tailor its attacks, while the resulting randomized aggregation scheme is still competitive in terms of performance. For both iid and non-iid settings, we establish almost sure convergence guarantees that are both stronger and more general than those available in the literature. Our empirical studies across various datasets, attacks, and settings, validate our hypothesis and show that MixTailor successfully defends when well-known Byzantine-tolerant schemes fail.

Viaarxiv icon

Training Efficiency and Robustness in Deep Learning

Dec 02, 2021
Fartash Faghri

Figure 1 for Training Efficiency and Robustness in Deep Learning
Figure 2 for Training Efficiency and Robustness in Deep Learning
Figure 3 for Training Efficiency and Robustness in Deep Learning
Figure 4 for Training Efficiency and Robustness in Deep Learning

Deep Learning has revolutionized machine learning and artificial intelligence, achieving superhuman performance in several standard benchmarks. It is well-known that deep learning models are inefficient to train; they learn by processing millions of training data multiple times and require powerful computational resources to process large batches of data in parallel at the same time rather than sequentially. Deep learning models also have unexpected failure modes; they can be fooled into misbehaviour, producing unexpectedly incorrect predictions. In this thesis, we study approaches to improve the training efficiency and robustness of deep learning models. In the context of learning visual-semantic embeddings, we find that prioritizing learning on more informative training data increases convergence speed and improves generalization performance on test data. We formalize a simple trick called hard negative mining as a modification to the learning objective function with no computational overhead. Next, we seek improvements to optimization speed in general-purpose optimization methods in deep learning. We show that a redundancy-aware modification to the sampling of training data improves the training speed and develops an efficient method for detecting the diversity of training signal, namely, gradient clustering. Finally, we study adversarial robustness in deep learning and approaches to achieve maximal adversarial robustness without training with additional data. For linear models, we prove guaranteed maximal robustness achieved only by appropriate choice of the optimizer, regularization, or architecture.

* A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy 
Viaarxiv icon

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

May 01, 2021
Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

Figure 1 for NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Figure 2 for NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Figure 3 for NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Figure 4 for NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

* This entry is redundant and was created in error. See arXiv:1908.06077 for the latest version 
Viaarxiv icon

Bridging the Gap Between Adversarial Robustness and Optimization Bias

Feb 17, 2021
Fartash Faghri, Cristina Vasconcelos, David J. Fleet, Fabian Pedregosa, Nicolas Le Roux

Figure 1 for Bridging the Gap Between Adversarial Robustness and Optimization Bias
Figure 2 for Bridging the Gap Between Adversarial Robustness and Optimization Bias
Figure 3 for Bridging the Gap Between Adversarial Robustness and Optimization Bias
Figure 4 for Bridging the Gap Between Adversarial Robustness and Optimization Bias

Adversarial robustness is an open challenge in deep learning, most often tackled using adversarial training. Adversarial training is computationally costly, involving alternated optimization with a trade-off between standard generalization and adversarial robustness. We explore training robust models without adversarial training by revisiting a known result linking maximally robust classifiers and minimum norm solutions, and combining it with recent results on the implicit bias of optimizers. First, we show that, under certain conditions, it is possible to achieve both perfect standard accuracy and a certain degree of robustness without a trade-off, simply by training an overparameterized model using the implicit bias of the optimization. In that regime, there is a direct relationship between the type of the optimizer and the attack to which the model is robust. Second, we investigate the role of the architecture in designing robust models. In particular, we characterize the robustness of linear convolutional models, showing that they resist attacks subject to a constraint on the Fourier-$\ell_\infty$ norm. This result explains the property of $\ell_p$-bounded adversarial perturbations that tend to be concentrated in the Fourier domain. This leads us to a novel attack in the Fourier domain that is inspired by the well-known frequency-dependent sensitivity of human perception. We evaluate Fourier-$\ell_\infty$ robustness of recent CIFAR-10 models with robust training and visualize adversarial perturbations.

Viaarxiv icon

Adaptive Gradient Quantization for Data-Parallel SGD

Oct 23, 2020
Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel Roy, Ali Ramezani-Kebrya

Figure 1 for Adaptive Gradient Quantization for Data-Parallel SGD
Figure 2 for Adaptive Gradient Quantization for Data-Parallel SGD
Figure 3 for Adaptive Gradient Quantization for Data-Parallel SGD
Figure 4 for Adaptive Gradient Quantization for Data-Parallel SGD

Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.

* Accepted at the conference on Neural Information Processing Systems (NeurIPS 2020) 
Viaarxiv icon