Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jose M. Alvarez

Structural Pruning via Latency-Saliency Knapsack

Oct 18, 2022

Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez

Figure 1 for Structural Pruning via Latency-Saliency Knapsack

Figure 2 for Structural Pruning via Latency-Saliency Knapsack

Figure 3 for Structural Pruning via Latency-Saliency Knapsack

Figure 4 for Structural Pruning via Latency-Saliency Knapsack

Abstract:Structural pruning can simplify network architecture and improve inference speed. We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device. For filter importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency trade-off. We examine HALP on both classification and detection tasks, over varying networks, on ImageNet and VOC datasets, on different platforms. In particular, for ResNet-50/-101 pruning on ImageNet, HALP improves network throughput by $1.60\times$/$1.90\times$ with $+0.3\%$/$-0.2\%$ top-1 accuracy changes, respectively. For SSD pruning on VOC, HALP improves throughput by $1.94\times$ with only a $0.56$ mAP drop. HALP consistently outperforms prior art, sometimes by large margins. Project page at https://halp-neurips.github.io/.

* Accepted by NeurIPS 2022. arXiv admin note: substantial text overlap with arXiv:2110.10811

Via

Access Paper or Ask Questions

Optimizing Data Collection for Machine Learning

Oct 03, 2022

Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law

Figure 1 for Optimizing Data Collection for Machine Learning

Figure 2 for Optimizing Data Collection for Machine Learning

Figure 3 for Optimizing Data Collection for Machine Learning

Figure 4 for Optimizing Data Collection for Machine Learning

Abstract:Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.

* Accepted to NeurIPS 2022

Via

Access Paper or Ask Questions

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Jul 13, 2022

Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law

Figure 1 for How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Figure 2 for How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Figure 3 for How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Figure 4 for How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Abstract:Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

Object-Level Targeted Selection via Deep Template Matching

Jul 05, 2022

Suraj Kothawade, Donna Roy, Michele Fenzi, Elmar Haussmann, Jose M. Alvarez, Christoph Angerer

Figure 1 for Object-Level Targeted Selection via Deep Template Matching

Figure 2 for Object-Level Targeted Selection via Deep Template Matching

Figure 3 for Object-Level Targeted Selection via Deep Template Matching

Figure 4 for Object-Level Targeted Selection via Deep Template Matching

Abstract:Retrieving images with objects that are semantically similar to objects of interest (OOI) in a query image has many practical use cases. A few examples include fixing failures like false negatives/positives of a learned model or mitigating class imbalance in a dataset. The targeted selection task requires finding the relevant data from a large-scale pool of unlabeled data. Manual mining at this scale is infeasible. Further, the OOI are often small and occupy less than 1% of image area, are occluded, and co-exist with many semantically different objects in cluttered scenes. Existing semantic image retrieval methods often focus on mining for larger sized geographical landmarks, and/or require extra labeled data, such as images/image-pairs with similar objects, for mining images with generic objects. We propose a fast and robust template matching algorithm in the DNN feature space, that retrieves semantically similar images at the object-level from a large unlabeled pool of data. We project the region(s) around the OOI in the query image to the DNN feature space for use as the template. This enables our method to focus on the semantics of the OOI without requiring extra labeled data. In the context of autonomous driving, we evaluate our system for targeted selection by using failure cases of object detectors as OOI. We demonstrate its efficacy on a large unlabeled dataset with 2.2M images and show high recall in mining for images with small-sized OOI. We compare our method against a well-known semantic image retrieval method, which also does not require extra labeled data. Lastly, we show that our method is flexible and retrieves images with one or more semantically different co-occurring OOI seamlessly.

* In Proceedings of the Intelligent Vehicles Symposium, IV 2022

Via

Access Paper or Ask Questions

Knowledge Distillation for 6D Pose Estimation by Keypoint Distribution Alignment

May 30, 2022

Shuxuan Guo, Yinlin Hu, Jose M. Alvarez, Mathieu Salzmann

Figure 1 for Knowledge Distillation for 6D Pose Estimation by Keypoint Distribution Alignment

Figure 2 for Knowledge Distillation for 6D Pose Estimation by Keypoint Distribution Alignment

Figure 3 for Knowledge Distillation for 6D Pose Estimation by Keypoint Distribution Alignment

Figure 4 for Knowledge Distillation for 6D Pose Estimation by Keypoint Distribution Alignment

Abstract:Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method for 6D pose estimation. Specifically, we follow a standard approach to 6D pose estimation, consisting of predicting the 2D image locations of object keypoints. In this context, we observe the compact student network to struggle predicting precise 2D keypoint locations. Therefore, to address this, instead of training the student with keypoint-to-keypoint supervision, we introduce a strategy based the optimal transport theory that distills the teacher's keypoint \emph{distribution} into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models.

Via

Access Paper or Ask Questions

Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

May 08, 2022

Jiayu Yang, Jose M. Alvarez, Miaomiao Liu

Figure 1 for Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

Figure 2 for Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

Figure 3 for Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

Figure 4 for Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

Abstract:Recent cost volume pyramid based deep neural networks have unlocked the potential of efficiently leveraging high-resolution images for depth inference from multi-view stereo. In general, those approaches assume that the depth of each pixel follows a unimodal distribution. Boundary pixels usually follow a multi-modal distribution as they represent different depths; Therefore, the assumption results in an erroneous depth prediction at the coarser level of the cost volume pyramid and can not be corrected in the refinement levels leading to wrong depth predictions. In contrast, we propose constructing the cost volume by non-parametric depth distribution modeling to handle pixels with unimodal and multi-modal distributions. Our approach outputs multiple depth hypotheses at the coarser level to avoid errors in the early stage. As we perform local search around these multiple hypotheses in subsequent levels, our approach does not maintain the rigid depth spatial ordering and, therefore, we introduce a sparse cost aggregation network to derive information within each volume. We evaluate our approach extensively on two benchmark datasets: DTU and Tanks & Temples. Our experimental results show that our model outperforms existing methods by a large margin and achieves superior performance on boundary regions. Code is available at https://github.com/NVlabs/NP-CVP-MVSNet

* CVPR 2022

Via

Access Paper or Ask Questions

Understanding The Robustness in Vision Transformers

Apr 27, 2022

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng, Jose M. Alvarez

Figure 1 for Understanding The Robustness in Vision Transformers

Figure 2 for Understanding The Robustness in Vision Transformers

Figure 3 for Understanding The Robustness in Vision Transformers

Figure 4 for Understanding The Robustness in Vision Transformers

Abstract:Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code will be available at https://github.com/NVlabs/FAN.

Via

Access Paper or Ask Questions

M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Apr 19, 2022

Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, Jose M. Alvarez

Figure 1 for M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Figure 2 for M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Figure 3 for M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Figure 4 for M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

Abstract:In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.

* Tech Report

Via

Access Paper or Ask Questions

FreeSOLO: Learning to Segment Objects without Annotations

Feb 24, 2022

Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandkumar, Chunhua Shen, Jose M. Alvarez

Figure 1 for FreeSOLO: Learning to Segment Objects without Annotations

Figure 2 for FreeSOLO: Learning to Segment Objects without Annotations

Figure 3 for FreeSOLO: Learning to Segment Objects without Annotations

Figure 4 for FreeSOLO: Learning to Segment Objects without Annotations

Abstract:Instance segmentation is a fundamental vision task that aims to recognize and segment each object in an image. However, it requires costly annotations such as bounding boxes and segmentation masks for learning. In this work, we propose a fully unsupervised learning method that learns class-agnostic instance segmentation without any annotations. We present FreeSOLO, a self-supervised instance segmentation framework built on top of the simple instance segmentation method SOLO. Our method also presents a novel localization-aware pre-training framework, where objects can be discovered from complicated scenes in an unsupervised manner. FreeSOLO achieves 9.8% AP_{50} on the challenging COCO dataset, which even outperforms several segmentation proposal methods that use manual annotations. For the first time, we demonstrate unsupervised class-agnostic instance segmentation successfully. FreeSOLO's box localization significantly outperforms state-of-the-art unsupervised object detection/discovery methods, with about 100% relative improvements in COCO AP. FreeSOLO further demonstrates superiority as a strong pre-training method, outperforming state-of-the-art self-supervised pre-training methods by +9.8% AP when fine-tuning instance segmentation with only 5% COCO masks.

* 13 pages

Via

Access Paper or Ask Questions

Fairness implications of encoding protected categorical attributes

Jan 27, 2022

Carlos Mougan, Jose M. Alvarez, Gourab K Patro, Salvatore Ruggieri, Steffen Staab

Figure 1 for Fairness implications of encoding protected categorical attributes

Figure 2 for Fairness implications of encoding protected categorical attributes

Figure 3 for Fairness implications of encoding protected categorical attributes

Figure 4 for Fairness implications of encoding protected categorical attributes

Abstract:Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias

* 22 pages

Via

Access Paper or Ask Questions