Alert button
Picture for Andrew Howard

Andrew Howard

Alert button

ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation

Jun 29, 2023
Shuyang Sun, Weijun Wang, Qihang Yu, Andrew Howard, Philip Torr, Liang-Chieh Chen

Figure 1 for ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation
Figure 2 for ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation
Figure 3 for ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation
Figure 4 for ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation

This paper presents a new mechanism to facilitate the training of mask transformers for efficient panoptic segmentation, democratizing its deployment. We observe that due to its high complexity, the training objective of panoptic segmentation will inevitably lead to much higher false positive penalization. Such unbalanced loss makes the training process of the end-to-end mask-transformer based architectures difficult, especially for efficient models. In this paper, we present ReMaX that adds relaxation to mask predictions and class predictions during training for panoptic segmentation. We demonstrate that via these simple relaxation techniques during training, our model can be consistently improved by a clear margin \textbf{without} any extra computational cost on inference. By combining our method with efficient backbones like MobileNetV3-Small, our method achieves new state-of-the-art results for efficient panoptic segmentation on COCO, ADE20K and Cityscapes. Code and pre-trained checkpoints will be available at \url{https://github.com/google-research/deeplab2}.

Viaarxiv icon

On Label Granularity and Object Localization

Jul 20, 2022
Elijah Cole, Kimberly Wilber, Grant Van Horn, Xuan Yang, Marco Fornoni, Pietro Perona, Serge Belongie, Andrew Howard, Oisin Mac Aodha

Figure 1 for On Label Granularity and Object Localization
Figure 2 for On Label Granularity and Object Localization
Figure 3 for On Label Granularity and Object Localization
Figure 4 for On Label Granularity and Object Localization

Weakly supervised object localization (WSOL) aims to learn representations that encode object location using only image-level category labels. However, many objects can be labeled at different levels of granularity. Is it an animal, a bird, or a great horned owl? Which image-level labels should we use? In this paper we study the role of label granularity in WSOL. To facilitate this investigation we introduce iNatLoc500, a new large-scale fine-grained benchmark dataset for WSOL. Surprisingly, we find that choosing the right training label granularity provides a much larger performance boost than choosing the best WSOL algorithm. We also show that changing the label granularity can significantly improve data efficiency.

* ECCV 2022 
Viaarxiv icon

MOSAIC: Mobile Segmentation via decoding Aggregated Information and encoded Context

Dec 22, 2021
Weijun Wang, Andrew Howard

Figure 1 for MOSAIC: Mobile Segmentation via decoding Aggregated Information and encoded Context
Figure 2 for MOSAIC: Mobile Segmentation via decoding Aggregated Information and encoded Context
Figure 3 for MOSAIC: Mobile Segmentation via decoding Aggregated Information and encoded Context
Figure 4 for MOSAIC: Mobile Segmentation via decoding Aggregated Information and encoded Context

We present a next-generation neural network architecture, MOSAIC, for efficient and accurate semantic image segmentation on mobile devices. MOSAIC is designed using commonly supported neural operations by diverse mobile hardware platforms for flexible deployment across various mobile platforms. With a simple asymmetric encoder-decoder structure which consists of an efficient multi-scale context encoder and a light-weight hybrid decoder to recover spatial details from aggregated information, MOSAIC achieves new state-of-the-art performance while balancing accuracy and computational cost. Deployed on top of a tailored feature extraction backbone based on a searched classification network, MOSAIC achieves a 5% absolute accuracy gain surpassing the current industry standard MLPerf models and state-of-the-art architectures.

Viaarxiv icon

Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Jun 18, 2021
Marco Fornoni, Chaochao Yan, Liangchen Luo, Kimberly Wilber, Alex Stark, Yin Cui, Boqing Gong, Andrew Howard

Figure 1 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation
Figure 2 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation
Figure 3 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation
Figure 4 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation

When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.

Viaarxiv icon

BasisNet: Two-stage Model Synthesis for Efficient Inference

May 07, 2021
Mingda Zhang, Chun-Te Chu, Andrey Zhmoginov, Andrew Howard, Brendan Jou, Yukun Zhu, Li Zhang, Rebecca Hwa, Adriana Kovashka

Figure 1 for BasisNet: Two-stage Model Synthesis for Efficient Inference
Figure 2 for BasisNet: Two-stage Model Synthesis for Efficient Inference
Figure 3 for BasisNet: Two-stage Model Synthesis for Efficient Inference
Figure 4 for BasisNet: Two-stage Model Synthesis for Efficient Inference

In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction. The two-stage model synthesis strategy can be applied to any network architectures and both stages are jointly trained. We also show that proper training recipes are critical for increasing generalizability for such high capacity neural networks. On ImageNet classification benchmark, our BasisNet with MobileNets as backbone demonstrated clear advantage on accuracy-efficiency trade-off over several strong baselines. Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations, halving the computational cost of previous state-of-the-art without sacrificing accuracy. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0% on ImageNet.

* To appear, 4th Workshop on Efficient Deep Learning for Computer Vision (ECV2021), CVPR2021 Workshop 
Viaarxiv icon

SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection

Jan 04, 2021
Keren Ye, Adriana Kovashka, Mark Sandler, Menglong Zhu, Andrew Howard, Marco Fornoni

Figure 1 for SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection
Figure 2 for SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection
Figure 3 for SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection
Figure 4 for SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection

Deep learning based object detectors are commonly deployed on mobile devices to solve a variety of tasks. For maximum accuracy, each detector is usually trained to solve one single specific task, and comes with a completely independent set of parameters. While this guarantees high performance, it is also highly inefficient, as each model has to be separately downloaded and stored. In this paper we address the question: can task-specific detectors be trained and represented as a shared set of weights, plus a very small set of additional weights for each task? The main contributions of this paper are the following: 1) we perform the first systematic study of parameter-efficient transfer learning techniques for object detection problems; 2) we propose a technique to learn a model patch with a size that is dependent on the difficulty of the task to be learned, and validate our approach on 10 different object detection tasks. Our approach achieves similar accuracy as previously proposed approaches, while being significantly more compact.

* Accepted by the ACCV2020 (Oral) 
Viaarxiv icon

Large-Scale Generative Data-Free Distillation

Dec 10, 2020
Liangchen Luo, Mark Sandler, Zi Lin, Andrey Zhmoginov, Andrew Howard

Figure 1 for Large-Scale Generative Data-Free Distillation
Figure 2 for Large-Scale Generative Data-Free Distillation
Figure 3 for Large-Scale Generative Data-Free Distillation
Figure 4 for Large-Scale Generative Data-Free Distillation

Knowledge distillation is one of the most popular and effective techniques for knowledge transfer, model compression and semi-supervised learning. Most existing distillation approaches require the access to original or augmented training samples. But this can be problematic in practice due to privacy, proprietary and availability concerns. Recent work has put forward some methods to tackle this problem, but they are either highly time-consuming or unable to scale to large datasets. To this end, we propose a new method to train a generative image model by leveraging the intrinsic normalization layers' statistics of the trained teacher network. This enables us to build an ensemble of generators without training data that can efficiently produce substitute inputs for subsequent distillation. The proposed method pushes forward the data-free distillation performance on CIFAR-10 and CIFAR-100 to 95.02% and 77.02% respectively. Furthermore, we are able to scale it to ImageNet dataset, which to the best of our knowledge, has never been done using generative models in a data-free setting.

Viaarxiv icon

Multi-path Neural Networks for On-device Multi-domain Visual Classification

Oct 10, 2020
Qifei Wang, Junjie Ke, Joshua Greaves, Grace Chu, Gabriel Bender, Luciano Sbaiz, Alec Go, Andrew Howard, Feng Yang, Ming-Hsuan Yang, Jeff Gilbert, Peyman Milanfar

Figure 1 for Multi-path Neural Networks for On-device Multi-domain Visual Classification
Figure 2 for Multi-path Neural Networks for On-device Multi-domain Visual Classification
Figure 3 for Multi-path Neural Networks for On-device Multi-domain Visual Classification
Figure 4 for Multi-path Neural Networks for On-device Multi-domain Visual Classification

Learning multiple domains/tasks with a single model is important for improving data efficiency and lowering inference cost for numerous vision tasks, especially on resource-constrained mobile devices. However, hand-crafting a multi-domain/task model can be both tedious and challenging. This paper proposes a novel approach to automatically learn a multi-path network for multi-domain visual classification on mobile devices. The proposed multi-path network is learned from neural architecture search by applying one reinforcement learning controller for each domain to select the best path in the super-network created from a MobileNetV3-like search space. An adaptive balanced domain prioritization algorithm is proposed to balance optimizing the joint model on multiple domains simultaneously. The determined multi-path model selectively shares parameters across domains in shared nodes while keeping domain-specific parameters within non-shared nodes in individual domain paths. This approach effectively reduces the total number of parameters and FLOPS, encouraging positive knowledge transfer while mitigating negative interference across domains. Extensive evaluations on the Visual Decathlon dataset demonstrate that the proposed multi-path model achieves state-of-the-art performance in terms of accuracy, model size, and FLOPS against other approaches using MobileNetV3-like architectures. Furthermore, the proposed method improves average accuracy over learning single-domain models individually, and reduces the total number of parameters and FLOPS by 78% and 32% respectively, compared to the approach that simply bundles single-domain models for multi-domain learning.

* conference 
Viaarxiv icon

Discovering Multi-Hardware Mobile Models via Architecture Search

Aug 18, 2020
Grace Chu, Okan Arikan, Gabriel Bender, Weijun Wang, Achille Brighton, Pieter-Jan Kindermans, Hanxiao Liu, Berkin Akin, Suyog Gupta, Andrew Howard

Figure 1 for Discovering Multi-Hardware Mobile Models via Architecture Search
Figure 2 for Discovering Multi-Hardware Mobile Models via Architecture Search
Figure 3 for Discovering Multi-Hardware Mobile Models via Architecture Search
Figure 4 for Discovering Multi-Hardware Mobile Models via Architecture Search

Developing efficient models for mobile phones or other on-device deployments has been a popular topic in both industry and academia. In such scenarios, it is often convenient to deploy the same model on a diverse set of hardware devices owned by different end users to minimize the costs of development, deployment and maintenance. Despite the importance, designing a single neural network that can perform well on multiple devices is difficult as each device has its own specialty and restrictions: A model optimized for one device may not perform well on another. While most existing work proposes different models optimized for each single hardware, this paper is the first which explores the problem of finding a single model that performs well on multiple hardware. Specifically, we leverage architecture search to help us find the best model, where given a set of diverse hardware to optimize for, we first introduce a multi-hardware search space that is compatible with all examined hardware. Then, to measure the performance of a neural network over multiple hardware, we propose metrics that can characterize the overall latency performance in an average case and worst case scenario. With the multi-hardware search space and new metrics applied to Pixel4 CPU, GPU, DSP and EdgeTPU, we found models that perform on par or better than state-of-the-art (SOTA) models on each of our target accelerators and generalize well on many un-targeted hardware. Comparing with single-hardware searches, multi-hardware search gives a better trade-off between computation cost and model performance.

Viaarxiv icon

Non-discriminative data or weak model? On the relative importance of data and model resolution

Oct 17, 2019
Mark Sandler, Jonathan Baccash, Andrey Zhmoginov, Andrew Howard

Figure 1 for Non-discriminative data or weak model? On the relative importance of data and model resolution
Figure 2 for Non-discriminative data or weak model? On the relative importance of data and model resolution
Figure 3 for Non-discriminative data or weak model? On the relative importance of data and model resolution
Figure 4 for Non-discriminative data or weak model? On the relative importance of data and model resolution

We explore the question of how the resolution of the input image ("input resolution") affects the performance of a neural network when compared to the resolution of the hidden layers ("internal resolution"). Adjusting these characteristics is frequently used as a hyperparameter providing a trade-off between model performance and accuracy. An intuitive interpretation is that the reduced information content in the low-resolution input causes decay in the accuracy. In this paper, we show that up to a point, the input resolution alone plays little role in the network performance, and it is the internal resolution that is the critical driver of model quality. We then build on these insights to develop novel neural network architectures that we call \emph{Isometric Neural Networks}. These models maintain a fixed internal resolution throughout their entire depth. We demonstrate that they lead to high accuracy models with low activation footprint and parameter count.

* ICCV 2019 Workshop on Real-World Recognition from Low-Quality Images and Videos 
Viaarxiv icon