Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nojun Kwak

LSQ+: Improving low-bit quantization through learnable offsets and better initialization

Apr 20, 2020
Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, Nojun Kwak

Figure 1 for LSQ+: Improving low-bit quantization through learnable offsets and better initialization

Figure 2 for LSQ+: Improving low-bit quantization through learnable offsets and better initialization

Figure 3 for LSQ+: Improving low-bit quantization through learnable offsets and better initialization

Figure 4 for LSQ+: Improving low-bit quantization through learnable offsets and better initialization

Unlike ReLU, newer activation functions (like Swish, H-swish, Mish) that are frequently employed in popular efficient architectures can also result in negative activation values, with skewed positive and negative ranges. Typical learnable quantization schemes [PACT, LSQ] assume unsigned quantization for activations and quantize all negative activations to zero which leads to significant loss in performance. Naively using signed quantization to accommodate these negative values requires an extra sign bit which is expensive for low-bit (2-, 3-, 4-bit) quantization. To solve this problem, we propose LSQ+, a natural extension of LSQ, wherein we introduce a general asymmetric quantization scheme with trainable scale and offset parameters that can learn to accommodate the negative activations. Gradient-based learnable quantization schemes also commonly suffer from high instability or variance in the final training performance, hence requiring a great deal of hyper-parameter tuning to reach a satisfactory performance. LSQ+ alleviates this problem by using an MSE-based initialization scheme for the quantization parameters. We show that this initialization leads to significantly lower variance in final performance across multiple training runs. Overall, LSQ+ shows state-of-the-art results for EfficientNet and MixNet and also significantly outperforms LSQ for low-bit quantization of neural nets with Swish activations (e.g.: 1.8% gain with W4A4 quantization and upto 5.6% gain with W2A2 quantization of EfficientNet-B0 on ImageNet dataset). To the best of our knowledge, ours is the first work to quantize such architectures to extremely low bit-widths.

* Camera-ready for Joint Workshop on Efficient Deep Learning in Computer Vision, CVPR 2020

Via

Access Paper or Ask Questions

Class-Imbalanced Semi-Supervised Learning

Feb 17, 2020
Minsung Hyun, Jisoo Jeong, Nojun Kwak

Figure 1 for Class-Imbalanced Semi-Supervised Learning

Figure 2 for Class-Imbalanced Semi-Supervised Learning

Figure 3 for Class-Imbalanced Semi-Supervised Learning

Figure 4 for Class-Imbalanced Semi-Supervised Learning

Semi-Supervised Learning (SSL) has achieved great success in overcoming the difficulties of labeling and making full use of unlabeled data. However, SSL has a limited assumption that the numbers of samples in different classes are balanced, and many SSL algorithms show lower performance for the datasets with the imbalanced class distribution. In this paper, we introduce a task of class-imbalanced semi-supervised learning (CISSL), which refers to semi-supervised learning with class-imbalanced data. In doing so, we consider class imbalance in both labeled and unlabeled sets. First, we analyze existing SSL methods in imbalanced environments and examine how the class imbalance affects SSL methods. Then we propose Suppressed Consistency Loss (SCL), a regularization method robust to class imbalance. Our method shows better performance than the conventional methods in the CISSL environment. In particular, the more severe the class imbalance and the smaller the size of the labeled data, the better our method performs.

* 16 pages

Via

Access Paper or Ask Questions

Feature-map-level Online Adversarial Knowledge Distillation

Feb 05, 2020
Inseop Chung, SeongUk Park, Jangho Kim, Nojun Kwak

Figure 1 for Feature-map-level Online Adversarial Knowledge Distillation

Figure 2 for Feature-map-level Online Adversarial Knowledge Distillation

Figure 3 for Feature-map-level Online Adversarial Knowledge Distillation

Figure 4 for Feature-map-level Online Adversarial Knowledge Distillation

Feature maps contain rich information about image intensity and spatial correlation. However, previous online knowledge distillation methods only utilize the class probabilities. Thus in this paper, we propose an online knowledge distillation method that transfers not only the knowledge of the class probabilities but also that of the feature map using the adversarial training framework. We train multiple networks simultaneously by employing discriminators to distinguish the feature map distributions of different networks. Each network has its corresponding discriminator which discriminates the feature map from its own as fake while classifying that of the other network as real. By training a network to fool the corresponding discriminator, it can learn the other network's feature map distribution. We show that our method performs better than the conventional direct alignment method such as L1 and is more suitable for online distillation. Also, we propose a novel cyclic learning scheme for training more than two networks together. We have applied our method to various network architectures on the classification task and discovered a significant improvement of performance especially in the case of training a pair of a small network and a large one.

Via

Access Paper or Ask Questions

SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

Dec 09, 2019
Hyojin Park, Lars Lowe Sjösund, YoungJoon Yoo, Nicolas Monet, Jihwan Bang, Nojun Kwak

Figure 1 for SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

Figure 2 for SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

Figure 3 for SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

Figure 4 for SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

Designing a lightweight and robust portrait segmentation algorithm is an important task for a wide range of face applications. However, the problem has been considered as a subset of the object segmentation problem and less handled in the semantic segmentation field. Obviously, portrait segmentation has its unique requirements. First, because the portrait segmentation is performed in the middle of a whole process of many real-world applications, it requires extremely lightweight models. Second, there has not been any public datasets in this domain that contain a sufficient number of images with unbiased statistics. To solve the first problem, we introduce the new extremely lightweight portrait segmentation model SINet, containing an information blocking decoder and spatial squeeze modules. The information blocking decoder uses confidence estimates to recover local spatial information without spoiling global consistency. The spatial squeeze module uses multiple receptive fields to cope with various sizes of consistency in the image. To tackle the second problem, we propose a simple method to create additional portrait segmentation data which can improve accuracy on the EG1800 dataset. In our qualitative and quantitative analysis on the EG1800 dataset, we show that our method outperforms various existing lightweight segmentation models. Our method reduces the number of parameters from 2.1M to 86.9K (around 95.9% reduction), while maintaining the accuracy under an 1% margin from the state-of-the-art portrait segmentation method. We also show our model is successfully executed on a real mobile device with 100.6 FPS. In addition, we demonstrate that our method can be used for general semantic segmentation on the Cityscapes dataset. The code and dataset are available in https://github.com/HYOJINPARK/ExtPortraitSeg .

* https://github.com/HYOJINPARK/ExtPortraitSeg. arXiv admin note: text overlap with arXiv:1908.03093

Via

Access Paper or Ask Questions

Mixture-Model-based Bounding Box Density Estimation for Object Detection

Nov 28, 2019
Jaeyoung Yoo, Geonseok Seo, Nojun Kwak

Figure 1 for Mixture-Model-based Bounding Box Density Estimation for Object Detection

Figure 2 for Mixture-Model-based Bounding Box Density Estimation for Object Detection

Figure 3 for Mixture-Model-based Bounding Box Density Estimation for Object Detection

Figure 4 for Mixture-Model-based Bounding Box Density Estimation for Object Detection

In this paper, we propose a new object detection model, Mixture-Model-based Object Detector (MMOD), that performs multi-object detection using a mixture model. Unlike previous studies, we use density estimation to deal with the multi-object detection task. MMOD captures the conditional distribution of bounding boxes for a given input image using a mixture model consisting of Gaussian and categorical distributions. For this purpose, we propose a method to extract object bounding boxes from a trained mixture model. In doing so, we also propose a new network structure and objective function for the MMOD. Our proposed method is not trained by assigning a ground truth bounding box to a specific location on the network's output. Instead, the mixture components are automatically learned to represent the distribution of the bounding box through density estimation. Therefore, MMOD does not require a large number of anchors and does not incur the positive-negative imbalance problem. This not only benefits the detection performance but also enhances the inference speed without requiring additional processing. We applied MMOD to Pascal VOC and MS COCO datasets, and outperform the detection performance with inference speed of other state-of-the-art fast object detection methods. (38.7 AP with 39ms per image on MS COCO without bells and whistles.) Code will be available.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

QKD: Quantization-aware Knowledge Distillation

Nov 28, 2019
Jangho Kim, Yash Bhalgat, Jinwon Lee, Chirag Patel, Nojun Kwak

Figure 1 for QKD: Quantization-aware Knowledge Distillation

Figure 2 for QKD: Quantization-aware Knowledge Distillation

Figure 3 for QKD: Quantization-aware Knowledge Distillation

Figure 4 for QKD: Quantization-aware Knowledge Distillation

Quantization and Knowledge distillation (KD) methods are widely used to reduce memory and power consumption of deep neural networks (DNNs), especially for resource-constrained edge devices. Although their combination is quite promising to meet these requirements, it may not work as desired. It is mainly because the regularization effect of KD further diminishes the already reduced representation power of a quantized model. To address this short-coming, we propose Quantization-aware Knowledge Distillation (QKD) wherein quantization and KD are care-fully coordinated in three phases. First, Self-studying (SS) phase fine-tunes a quantized low-precision student network without KD to obtain a good initialization. Second, Co-studying (CS) phase tries to train a teacher to make it more quantizaion-friendly and powerful than a fixed teacher. Finally, Tutoring (TU) phase transfers knowledge from the trained teacher to the student. We extensively evaluate our method on ImageNet and CIFAR-10/100 datasets and show an ablation study on networks with both standard and depthwise-separable convolutions. The proposed QKD outperformed existing state-of-the-art methods (e.g., 1.3% improvement on ResNet-18 with W4A4, 2.6% on MobileNetV2 with W4A4). Additionally, QKD could recover the full-precision accuracy at as low as W3A3 quantization on ResNet and W6A6 quantization on MobilenetV2.

Via

Access Paper or Ask Questions

Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

Nov 19, 2019
Daesik Kim, Gyujeong Lee, Jisoo Jeong, Nojun Kwak

Figure 1 for Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

Figure 2 for Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

Figure 3 for Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

Figure 4 for Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

In this work, we introduce a novel weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that have not many examples using transferable knowledge from human-object interactions (HOI). While WSOD shows lower performance than full supervision, we mainly focus on HOI as the main context which can strongly supervise complex semantics in images. Therefore, we propose a novel module called RRPN (relational region proposal network) which outputs an object-localizing attention map only with human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervision of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to the object detection but also to other domains such as semantic segmentation. The experimental results on HICO-DET dataset show the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects on HICO-DET and V-COCO datasets.

* AAAI 2020 Oral Camera Ready

Via

Access Paper or Ask Questions

FEED: Feature-level Ensemble for Knowledge Distillation

Sep 24, 2019
SeongUk Park, Nojun Kwak

Figure 1 for FEED: Feature-level Ensemble for Knowledge Distillation

Figure 2 for FEED: Feature-level Ensemble for Knowledge Distillation

Figure 3 for FEED: Feature-level Ensemble for Knowledge Distillation

Figure 4 for FEED: Feature-level Ensemble for Knowledge Distillation

Knowledge Distillation (KD) aims to transfer knowledge in a teacher-student framework, by providing the predictions of the teacher network to the student network in the training stage to help the student network generalize better. It can use either a teacher with high capacity or {an} ensemble of multiple teachers. However, the latter is not convenient when one wants to use feature-map-based distillation methods. For a solution, this paper proposes a versatile and powerful training algorithm named FEature-level Ensemble for knowledge Distillation (FEED), which aims to transfer the ensemble knowledge using multiple teacher networks. We introduce a couple of training algorithms that transfer ensemble knowledge to the student at the feature map level. Among the feature-map-based distillation methods, using several non-linear transformations in parallel for transferring the knowledge of the multiple teacher{s} helps the student find more generalized solutions. We name this method as parallel FEED, andexperimental results on CIFAR-100 and ImageNet show that our method has clear performance enhancements, without introducing any additional parameters or computations at test time. We also show the experimental results of sequentially feeding teacher's information to the student, hence the name sequential FEED, and discuss the lessons obtained. Additionally, the empirical results on measuring the reconstruction errors at the feature map give hints for the enhancements.

* 7 pages

Via

Access Paper or Ask Questions