Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jose M. Alvarez

HALP: Hardware-Aware Latency Pruning

Oct 20, 2021

Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez

Figure 1 for HALP: Hardware-Aware Latency Pruning

Figure 2 for HALP: Hardware-Aware Latency Pruning

Figure 3 for HALP: Hardware-Aware Latency Pruning

Figure 4 for HALP: Hardware-Aware Latency Pruning

Abstract:Structural pruning can simplify network architecture and improve inference speed. We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget. For filter importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency trade-off. We examine HALP on both classification and detection tasks, over varying networks, on ImageNet and VOC datasets. In particular, for ResNet-50/-101 pruning on ImageNet, HALP improves network throughput by $1.60\times$/$1.90\times$ with $+0.3\%$/$-0.2\%$ top-1 accuracy changes, respectively. For SSD pruning on VOC, HALP improves throughput by $1.94\times$ with only a $0.56$ mAP drop. HALP consistently outperforms prior art, sometimes by large margins.

Via

Access Paper or Ask Questions

Panoptic SegFormer

Sep 11, 2021

Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Tong Lu, Ping Luo

Abstract:We present Panoptic SegFormer, a general framework for end-to-end panoptic segmentation with Transformers. The proposed method extends Deformable DETR with a unified mask prediction workflow for both things and stuff, making the panoptic segmentation pipeline concise and effective. With a ResNet-50 backbone, our method achieves 50.0\% PQ on the COCO test-dev split, surpassing previous state-of-the-art methods by significant margins without bells and whistles. Using a more powerful PVTv2-B5 backbone, Panoptic-SegFormer achieves a new record of 54.1\%PQ and 54.4\% PQ on the COCO val and test-dev splits with single scale input.

* Technical Report

Via

Access Paper or Ask Questions

Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion

Jul 13, 2021

Xin Dong, Hongxu Yin, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

Figure 1 for Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion

Figure 2 for Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion

Figure 3 for Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion

Figure 4 for Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion

Abstract:Understanding the behavior and vulnerability of pre-trained deep neural networks (DNNs) can help to improve them. Analysis can be performed via reversing the network's flow to generate inputs from internal representations. Most existing work relies on priors or data-intensive optimization to invert a model, yet struggles to scale to deep architectures and complex datasets. This paper presents a zero-shot direct model inversion framework that recovers the input to the trained model given only the internal representation. The crux of our method is to inverse the DNN in a divide-and-conquer manner while re-syncing the inverted layers via cycle-consistency guidance with the help of synthesized data. As a result, we obtain a single feed-forward model capable of inversion with a single forward pass without seeing any real data of the original task. With the proposed approach, we scale zero-shot direct inversion to deep architectures and complex datasets. We empirically show that modern classification models on ImageNet can, surprisingly, be inverted, allowing an approximate recovery of the original 224x224px images from a representation after more than 20 layers. Moreover, inversion of generators in GANs unveils latent code of a given synthesized face image at 128x128px, which can even, in turn, improve defective synthesized images from GANs.

* A new inversion method to reverse neural networks and get input from intermediate feature maps. Works without original data for classifiers and GANs

Via

Access Paper or Ask Questions

Towards Reducing Labeling Cost in Deep Object Detection

Jun 22, 2021

Ismail Elezi, Zhiding Yu, Anima Anandkumar, Laura Leal-Taixe, Jose M. Alvarez

Figure 1 for Towards Reducing Labeling Cost in Deep Object Detection

Figure 2 for Towards Reducing Labeling Cost in Deep Object Detection

Figure 3 for Towards Reducing Labeling Cost in Deep Object Detection

Figure 4 for Towards Reducing Labeling Cost in Deep Object Detection

Abstract:Deep neural networks have reached very high accuracy on object detection but their success hinges on large amounts of labeled data. To reduce the dependency on labels, various active-learning strategies have been proposed, typically based on the confidence of the detector. However, these methods are biased towards best-performing classes and can lead to acquired datasets that are not good representatives of the data in the testing set. In this work, we propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector, ensuring that the network performs accurately in all classes. Furthermore, our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift while further boosting the performance of the model. Experiments show that our method comprehensively outperforms a wide range of active-learning methods on PASCAL VOC07+12 and MS-COCO, having up to a 7.7% relative improvement, or up to 82% reduction in labeling cost.

* Includes supplementary material

Via

Access Paper or Ask Questions

Distilling Image Classifiers in Object Detectors

Jun 09, 2021

Shuxuan Guo, Jose M. Alvarez, Mathieu Salzmann

Figure 1 for Distilling Image Classifiers in Object Detectors

Figure 2 for Distilling Image Classifiers in Object Detectors

Figure 3 for Distilling Image Classifiers in Object Detectors

Figure 4 for Distilling Image Classifiers in Object Detectors

Abstract:Knowledge distillation constitutes a simple yet effective way to improve the performance of a compact student network by exploiting the knowledge of a more powerful teacher. Nevertheless, the knowledge distillation literature remains limited to the scenario where the student and the teacher tackle the same task. Here, we investigate the problem of transferring knowledge not only across architectures but also across tasks. To this end, we study the case of object detection and, instead of following the standard detector-to-detector distillation approach, introduce a classifier-to-detector knowledge transfer framework. In particular, we propose strategies to exploit the classification teacher to improve both the detector's recognition accuracy and localization performance. Our experiments on several detectors with different backbones demonstrate the effectiveness of our approach, allowing us to outperform the state-of-the-art detector-to-detector distillation methods.

Via

Access Paper or Ask Questions

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Jun 05, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo

Figure 1 for SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Figure 2 for SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Figure 3 for SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Figure 4 for SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Abstract:We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

* Tech Report

Via

Access Paper or Ask Questions

See through Gradients: Image Batch Recovery via GradInversion

Apr 15, 2021

Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

Figure 1 for See through Gradients: Image Batch Recovery via GradInversion

Figure 2 for See through Gradients: Image Batch Recovery via GradInversion

Figure 3 for See through Gradients: Image Batch Recovery via GradInversion

Figure 4 for See through Gradients: Image Batch Recovery via GradInversion

Abstract:Training deep neural networks requires gradient estimation from data batches to update parameters. Gradients per parameter are averaged over a set of data and this has been presumed to be safe for privacy-preserving training in joint, collaborative, and federated learning applications. Prior work only showed the possibility of recovering input data given gradients under very restrictive conditions - a single input point, or a network with no non-linearities, or a small 32x32 px input batch. Therefore, averaging gradients over larger batches was thought to be safe. In this work, we introduce GradInversion, using which input images from a larger batch (8 - 48 images) can also be recovered for large networks such as ResNets (50 layers), on complex datasets such as ImageNet (1000 classes, 224x224 px). We formulate an optimization task that converts random noise into natural images, matching gradients while regularizing image fidelity. We also propose an algorithm for target class label recovery given gradients. We further propose a group consistency regularization framework, where multiple agents starting from different random seeds work together to find an enhanced reconstruction of original data batch. We show that gradients encode a surprisingly large amount of information, such that all the individual images can be recovered with high fidelity via GradInversion, even for complex datasets, deep networks, and large batch sizes.

* CVPR 2021 accepted paper

Via

Access Paper or Ask Questions

Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Apr 12, 2021

Nadine Chang, Zhiding Yu, Yu-Xiong Wang, Anima Anandkumar, Sanja Fidler, Jose M. Alvarez

Figure 1 for Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Figure 2 for Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Figure 3 for Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Figure 4 for Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Abstract:Training on datasets with long-tailed distributions has been challenging for major recognition tasks such as classification and detection. To deal with this challenge, image resampling is typically introduced as a simple but effective approach. However, we observe that long-tailed detection differs from classification since multiple classes may be present in one image. As a result, image resampling alone is not enough to yield a sufficiently balanced distribution at the object level. We address object-level resampling by introducing an object-centric memory replay strategy based on dynamic, episodic memory banks. Our proposed strategy has two benefits: 1) convenient object-level resampling without significant extra computation, and 2) implicit feature-level augmentation from model updates. We show that image-level and object-level resamplings are both important, and thus unify them with a joint resampling strategy (RIO). Our method outperforms state-of-the-art long-tailed detection and segmentation methods on LVIS v0.5 across various backbones.

Via

Access Paper or Ask Questions

Self-supervised Learning of Depth Inference for Multi-view Stereo

Apr 07, 2021

Jiayu Yang, Jose M. Alvarez, Miaomiao Liu

Figure 1 for Self-supervised Learning of Depth Inference for Multi-view Stereo

Figure 2 for Self-supervised Learning of Depth Inference for Multi-view Stereo

Figure 3 for Self-supervised Learning of Depth Inference for Multi-view Stereo

Figure 4 for Self-supervised Learning of Depth Inference for Multi-view Stereo

Abstract:Recent supervised multi-view depth estimation networks have achieved promising results. Similar to all supervised approaches, these networks require ground-truth data during training. However, collecting a large amount of multi-view depth data is very challenging. Here, we propose a self-supervised learning framework for multi-view stereo that exploit pseudo labels from the input data. We start by learning to estimate depth maps as initial pseudo labels under an unsupervised learning framework relying on image reconstruction loss as supervision. We then refine the initial pseudo labels using a carefully designed pipeline leveraging depth information inferred from higher resolution images and neighboring views. We use these high-quality pseudo labels as the supervision signal to train the network and improve, iteratively, its performance by self-training. Extensive experiments on the DTU dataset show that our proposed self-supervised learning framework outperforms existing unsupervised multi-view stereo networks by a large margin and performs on par compared to the supervised counterpart. Code is available at https://github.com/JiayuYANG/Self-supervised-CVP-MVSNet.

* CVPR 2021

Via

Access Paper or Ask Questions

Contrastive Syn-to-Real Generalization

Apr 06, 2021

Wuyang Chen, Zhiding Yu, Shalini De Mello, Sifei Liu, Jose M. Alvarez, Zhangyang Wang, Anima Anandkumar

Figure 1 for Contrastive Syn-to-Real Generalization

Figure 2 for Contrastive Syn-to-Real Generalization

Figure 3 for Contrastive Syn-to-Real Generalization

Figure 4 for Contrastive Syn-to-Real Generalization

Abstract:Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverages the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while promoting the diversity of feature embeddings as an inductive bias to improve generalization. In addition, we enhance the proposed CSG framework with attentional pooling (A-pool) to let the model focus on semantically important regions and further improve its generalization. We demonstrate the effectiveness of CSG on various synthetic training tasks, exhibiting state-of-the-art performance on zero-shot domain generalization.

* Accepted in ICLR 2021

Via

Access Paper or Ask Questions