Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bumsub Ham

Bi-directional Contrastive Learning for Domain Adaptive Semantic Segmentation

Jul 22, 2022

Geon Lee, Chanho Eom, Wonkyung Lee, Hyekang Park, Bumsub Ham

Figure 1 for Bi-directional Contrastive Learning for Domain Adaptive Semantic Segmentation

Figure 2 for Bi-directional Contrastive Learning for Domain Adaptive Semantic Segmentation

Figure 3 for Bi-directional Contrastive Learning for Domain Adaptive Semantic Segmentation

Figure 4 for Bi-directional Contrastive Learning for Domain Adaptive Semantic Segmentation

Abstract:We present a novel unsupervised domain adaptation method for semantic segmentation that generalizes a model trained with source images and corresponding ground-truth labels to a target domain. A key to domain adaptive semantic segmentation is to learn domain-invariant and discriminative features without target ground-truth labels. To this end, we propose a bi-directional pixel-prototype contrastive learning framework that minimizes intra-class variations of features for the same object class, while maximizing inter-class variations for different ones, regardless of domains. Specifically, our framework aligns pixel-level features and a prototype of the same object class in target and source images (i.e., positive pairs), respectively, sets them apart for different classes (i.e., negative pairs), and performs the alignment and separation processes toward the other direction with pixel-level features in the source image and a prototype in the target image. The cross-domain matching encourages domain-invariant feature representations, while the bidirectional pixel-prototype correspondences aggregate features for the same object class, providing discriminative features. To establish training pairs for contrastive learning, we propose to generate dynamic pseudo labels of target images using a non-parametric label transfer, that is, pixel-prototype correspondences across different domains. We also present a calibration method compensating class-wise domain biases of prototypes gradually during training.

* Accepted to ECCV 2022

Via

Access Paper or Ask Questions

OIMNet++: Prototypical Normalization and Localization-aware Learning for Person Search

Jul 21, 2022

Sanghoon Lee, Youngmin Oh, Donghyeon Baek, Junghyup Lee, Bumsub Ham

Figure 1 for OIMNet++: Prototypical Normalization and Localization-aware Learning for Person Search

Figure 2 for OIMNet++: Prototypical Normalization and Localization-aware Learning for Person Search

Figure 3 for OIMNet++: Prototypical Normalization and Localization-aware Learning for Person Search

Figure 4 for OIMNet++: Prototypical Normalization and Localization-aware Learning for Person Search

Abstract:We address the task of person search, that is, localizing and re-identifying query persons from a set of raw scene images. Recent approaches are typically built upon OIMNet, a pioneer work on person search, that learns joint person representations for performing both detection and person re-identification (reID) tasks. To obtain the representations, they extract features from pedestrian proposals, and then project them on a unit hypersphere with L2 normalization. These methods also incorporate all positive proposals, that sufficiently overlap with the ground truth, equally to learn person representations for reID. We have found that 1) the L2 normalization without considering feature distributions degenerates the discriminative power of person representations, and 2) positive proposals often also depict background clutter and person overlaps, which could encode noisy features to person representations. In this paper, we introduce OIMNet++ that addresses the aforementioned limitations. To this end, we introduce a novel normalization layer, dubbed ProtoNorm, that calibrates features from pedestrian proposals, while considering a long-tail distribution of person IDs, enabling L2 normalized person representations to be discriminative. We also propose a localization-aware feature learning scheme that encourages better-aligned proposals to contribute more in learning discriminative representations. Experimental results and analysis on standard person search benchmarks demonstrate the effectiveness of OIMNet++.

* Accepted to ECCV 2022

Via

Access Paper or Ask Questions

Video-based Person Re-identification with Spatial and Temporal Memory Networks

Aug 20, 2021

Chanho Eom, Geon Lee, Junghyup Lee, Bumsub Ham

Figure 1 for Video-based Person Re-identification with Spatial and Temporal Memory Networks

Figure 2 for Video-based Person Re-identification with Spatial and Temporal Memory Networks

Figure 3 for Video-based Person Re-identification with Spatial and Temporal Memory Networks

Figure 4 for Video-based Person Re-identification with Spatial and Temporal Memory Networks

Abstract:Video-based person re-identification (reID) aims to retrieve person videos with the same identity as a query person across multiple cameras. Spatial and temporal distractors in person videos, such as background clutter and partial occlusions over frames, respectively, make this task much more challenging than image-based person reID. We observe that spatial distractors appear consistently in a particular location, and temporal distractors show several patterns, e.g., partial occlusions occur in the first few frames, where such patterns provide informative cues for predicting which frames to focus on (i.e., temporal attentions). Based on this, we introduce a novel Spatial and Temporal Memory Networks (STMN). The spatial memory stores features for spatial distractors that frequently emerge across video frames, while the temporal memory saves attentions which are optimized for typical temporal patterns in person videos. We leverage the spatial and temporal memories to refine frame-level person representations and to aggregate the refined frame-level features into a sequence-level person representation, respectively, effectively handling spatial and temporal distractors in person videos. We also introduce a memory spread loss preventing our model from addressing particular items only in the memories. Experimental results on standard benchmarks, including MARS, DukeMTMC-VideoReID, and LS-VID, demonstrate the effectiveness of our method.

* International Conference on Computer Vision (ICCV) 2021

Via

Access Paper or Ask Questions

Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences

Aug 17, 2021

Hyunjong Park, Sanghoon Lee, Junghyup Lee, Bumsub Ham

Figure 1 for Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences

Figure 2 for Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences

Figure 3 for Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences

Figure 4 for Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences

Abstract:We address the problem of visible-infrared person re-identification (VI-reID), that is, retrieving a set of person images, captured by visible or infrared cameras, in a cross-modal setting. Two main challenges in VI-reID are intra-class variations across person images, and cross-modal discrepancies between visible and infrared images. Assuming that the person images are roughly aligned, previous approaches attempt to learn coarse image- or rigid part-level person representations that are discriminative and generalizable across different modalities. However, the person images, typically cropped by off-the-shelf object detectors, are not necessarily well-aligned, which distract discriminative person representation learning. In this paper, we introduce a novel feature learning framework that addresses these problems in a unified way. To this end, we propose to exploit dense correspondences between cross-modal person images. This allows to address the cross-modal discrepancies in a pixel-level, suppressing modality-related features from person representations more effectively. This also encourages pixel-wise associations between cross-modal local features, further facilitating discriminative feature learning for VI-reID. Extensive experiments and analyses on standard VI-reID benchmarks demonstrate the effectiveness of our approach, which significantly outperforms the state of the art.

* iccv 2021

Via

Access Paper or Ask Questions

Distance-aware Quantization

Aug 16, 2021

Dohyung kim, Junghyup Lee, Bumsub Ham

Figure 1 for Distance-aware Quantization

Figure 2 for Distance-aware Quantization

Figure 3 for Distance-aware Quantization

Figure 4 for Distance-aware Quantization

Abstract:We address the problem of network quantization, that is, reducing bit-widths of weights and/or activations to lighten network architectures. Quantization methods use a rounding function to map full-precision values to the nearest quantized ones, but this operation is not differentiable. There are mainly two approaches to training quantized networks with gradient-based optimizers. First, a straight-through estimator (STE) replaces the zero derivative of the rounding with that of an identity function, which causes a gradient mismatch problem. Second, soft quantizers approximate the rounding with continuous functions at training time, and exploit the rounding for quantization at test time. This alleviates the gradient mismatch, but causes a quantizer gap problem. We alleviate both problems in a unified framework. To this end, we introduce a novel quantizer, dubbed a distance-aware quantizer (DAQ), that mainly consists of a distance-aware soft rounding (DASR) and a temperature controller. To alleviate the gradient mismatch problem, DASR approximates the discrete rounding with the kernel soft argmax, which is based on our insight that the quantization can be formulated as a distance-based assignment problem between full-precision values and quantized ones. The controller adjusts the temperature parameter in DASR adaptively according to the input, addressing the quantizer gap problem. Experimental results on standard benchmarks show that DAQ outperforms the state of the art significantly for various bit-widths without bells and whistles.

* ICCV2021

Via

Access Paper or Ask Questions

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Aug 14, 2021

Donghyeon Baek, Youngmin Oh, Bumsub Ham

Figure 1 for Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Figure 2 for Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Figure 3 for Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Figure 4 for Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation

Abstract:We address the problem of generalized zero-shot semantic segmentation (GZS3) predicting pixel-wise semantic labels for seen and unseen classes. Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones (e.g., word2vec) to train novel classifiers for both seen and unseen classes. Although generative methods show decent performance, they have two limitations: (1) the visual features are biased towards seen classes; (2) the classifier should be retrained whenever novel unseen classes appear. We propose a discriminative approach to address these limitations in a unified framework. To this end, we leverage visual and semantic encoders to learn a joint embedding space, where the semantic encoder transforms semantic features to semantic prototypes that act as centers for visual features of corresponding classes. Specifically, we introduce boundary-aware regression (BAR) and semantic consistency (SC) losses to learn discriminative features. Our approach to exploiting the joint embedding space, together with BAR and SC terms, alleviates the seen bias problem. At test time, we avoid the retraining process by exploiting semantic prototypes as a nearest-neighbor (NN) classifier. To further alleviate the bias problem, we also propose an inference technique, dubbed Apollonius calibration (AC), that modulates the decision boundary of the NN classifier to the Apollonius circle adaptively. Experimental results demonstrate the effectiveness of our framework, achieving a new state of the art on standard benchmarks.

* Accepted to ICCV 2021

Via

Access Paper or Ask Questions

Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

Apr 02, 2021

Youngmin Oh, Beomjun Kim, Bumsub Ham

Figure 1 for Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

Figure 2 for Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

Figure 3 for Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

Figure 4 for Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

Abstract:We address the problem of weakly-supervised semantic segmentation (WSSS) using bounding box annotations. Although object bounding boxes are good indicators to segment corresponding objects, they do not specify object boundaries, making it hard to train convolutional neural networks (CNNs) for semantic segmentation. We find that background regions are perceptually consistent in part within an image, and this can be leveraged to discriminate foreground and background regions inside object bounding boxes. To implement this idea, we propose a novel pooling method, dubbed background-aware pooling (BAP), that focuses more on aggregating foreground features inside the bounding boxes using attention maps. This allows to extract high-quality pseudo segmentation labels to train CNNs for semantic segmentation, but the labels still contain noise especially at object boundaries. To address this problem, we also introduce a noise-aware loss (NAL) that makes the networks less susceptible to incorrect labels. Experimental results demonstrate that learning with our pseudo labels already outperforms state-of-the-art weakly- and semi-supervised methods on the PASCAL VOC 2012 dataset, and the NAL further boosts the performance.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions

Network Quantization with Element-wise Gradient Scaling

Apr 02, 2021

Junghyup Lee, Dohyung Kim, Bumsub Ham

Figure 1 for Network Quantization with Element-wise Gradient Scaling

Figure 2 for Network Quantization with Element-wise Gradient Scaling

Figure 3 for Network Quantization with Element-wise Gradient Scaling

Figure 4 for Network Quantization with Element-wise Gradient Scaling

Abstract:Network quantization aims at reducing bit-widths of weights and/or activations, particularly important for implementing deep neural networks with limited hardware resources. Most methods use the straight-through estimator (STE) to train quantized networks, which avoids a zero-gradient problem by replacing a derivative of a discretizer (i.e., a round function) with that of an identity function. Although quantized networks exploiting the STE have shown decent performance, the STE is sub-optimal in that it simply propagates the same gradient without considering discretization errors between inputs and outputs of the discretizer. In this paper, we propose an element-wise gradient scaling (EWGS), a simple yet effective alternative to the STE, training a quantized network better than the STE in terms of stability and accuracy. Given a gradient of the discretizer output, EWGS adaptively scales up or down each gradient element, and uses the scaled gradient as the one for the discretizer input to train quantized networks via backpropagation. The scaling is performed depending on both the sign of each gradient element and an error between the continuous input and discrete output of the discretizer. We adjust a scaling factor adaptively using Hessian information of a network. We show extensive experimental results on the image classification datasets, including CIFAR-10 and ImageNet, with diverse network architectures under a wide range of bit-width settings, demonstrating the effectiveness of our method.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions

HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

Apr 02, 2021

Jongyoun Noh, Sanghoon Lee, Bumsub Ham

Figure 1 for HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

Figure 2 for HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

Figure 3 for HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

Figure 4 for HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection

Abstract:We address the problem of 3D object detection, that is, estimating 3D object bounding boxes from point clouds. 3D object detection methods exploit either voxel-based or point-based features to represent 3D objects in a scene. Voxel-based features are efficient to extract, while they fail to preserve fine-grained 3D structures of objects. Point-based features, on the other hand, represent the 3D structures more accurately, but extracting these features is computationally expensive. We introduce in this paper a novel single-stage 3D detection method having the merit of both voxel-based and point-based features. To this end, we propose a new convolutional neural network (CNN) architecture, dubbed HVPR, that integrates both features into a single 3D representation effectively and efficiently. Specifically, we augment the point-based features with a memory module to reduce the computational cost. We then aggregate the features in the memory, semantically similar to each voxel-based one, to obtain a hybrid 3D representation in a form of a pseudo image, allowing to localize 3D objects in a single stage efficiently. We also propose an Attentive Multi-scale Feature Module (AMFM) that extracts scale-aware features considering the sparse and irregular patterns of point clouds. Experimental results on the KITTI dataset demonstrate the effectiveness and efficiency of our approach, achieving a better compromise in terms of speed and accuracy.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions

Learning with Privileged Information for Efficient Image Super-Resolution

Jul 15, 2020

Wonkyung Lee, Junghyup Lee, Dohyung Kim, Bumsub Ham

Figure 1 for Learning with Privileged Information for Efficient Image Super-Resolution

Figure 2 for Learning with Privileged Information for Efficient Image Super-Resolution

Figure 3 for Learning with Privileged Information for Efficient Image Super-Resolution

Figure 4 for Learning with Privileged Information for Efficient Image Super-Resolution

Abstract:Convolutional neural networks (CNNs) have allowed remarkable advances in single image super-resolution (SISR) over the last decade. Most SR methods based on CNNs have focused on achieving performance gains in terms of quality metrics, such as PSNR and SSIM, over classical approaches. They typically require a large amount of memory and computational units. FSRCNN, consisting of few numbers of convolutional layers, has shown promising results, while using an extremely small number of network parameters. We introduce in this paper a novel distillation framework, consisting of teacher and student networks, that allows to boost the performance of FSRCNN drastically. To this end, we propose to use ground-truth high-resolution (HR) images as privileged information. The encoder in the teacher learns the degradation process, subsampling of HR images, using an imitation loss. The student and the decoder in the teacher, having the same network architecture as FSRCNN, try to reconstruct HR images. Intermediate features in the decoder, affordable for the student to learn, are transferred to the student through feature distillation. Experimental results on standard benchmarks demonstrate the effectiveness and the generalization ability of our framework, which significantly boosts the performance of FSRCNN as well as other SR methods. Our code and model are available online: https://cvlab.yonsei.ac.kr/projects/PISR.

* ECCV-2020

Via

Access Paper or Ask Questions