Although unsupervised domain adaptation methods have been widely adopted across several computer vision tasks, it is more desirable if we can exploit a few labeled data from new domains encountered in a real application. The novel setting of the semi-supervised domain adaptation (SSDA) problem shares the challenges with the domain adaptation problem and the semi-supervised learning problem. However, a recent study shows that conventional domain adaptation and semi-supervised learning methods often result in less effective or negative transfer in the SSDA problem. In order to interpret the observation and address the SSDA problem, in this paper, we raise the intra-domain discrepancy issue within the target domain, which has never been discussed so far. Then, we demonstrate that addressing the intra-domain discrepancy leads to the ultimate goal of the SSDA problem. We propose an SSDA framework that aims to align features via alleviation of the intra-domain discrepancy. Our framework mainly consists of three schemes, i.e., attraction, perturbation, and exploration. First, the attraction scheme globally minimizes the intra-domain discrepancy within the target domain. Second, we demonstrate the incompatibility of the conventional adversarial perturbation methods with SSDA. Then, we present a domain adaptive adversarial perturbation scheme, which perturbs the given target samples in a way that reduces the intra-domain discrepancy. Finally, the exploration scheme locally aligns features in a class-wise manner complementary to the attraction scheme by selectively aligning unlabeled target features complementary to the perturbation scheme. We conduct extensive experiments on domain adaptation benchmark datasets such as DomainNet, Office-Home, and Office. Our method achieves state-of-the-art performances on all datasets.
Visible-infrared person re-identification (VI-ReID) is an important task in night-time surveillance applications, since visible cameras are difficult to capture valid appearance information under poor illumination conditions. Compared to traditional person re-identification that handles only the intra-modality discrepancy, VI-ReID suffers from additional cross-modality discrepancy caused by different types of imaging systems. To reduce both intra- and cross-modality discrepancies, we propose a Hierarchical Cross-Modality Disentanglement (Hi-CMD) method, which automatically disentangles ID-discriminative factors and ID-excluded factors from visible-thermal images. We only use ID-discriminative factors for robust cross-modality matching without ID-excluded factors such as pose or illumination. To implement our approach, we introduce an ID-preserving person image generation network and a hierarchical feature learning module. Our generation network learns the disentangled representation by generating a new cross-modality image with different poses and illuminations while preserving a person's identity. At the same time, the feature learning module enables our model to explicitly extract the common ID-discriminative characteristic between visible-infrared images. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods on two VI-ReID datasets.
Most video person re-identification (re-ID) methods are mainly based on supervised learning, which requires laborious cross-camera ID labeling. Due to this limit, it is difficult to increase the number of cameras for constructing a large camera network. In this paper, we address the person ID labeling issue by presenting novel deep representation learning without ID information across multiple cameras. Specifically, our method consists of both inter- and intra camera feature learning techniques. We maximize feature distances between people within a camera. At the same time, considering each camera as a different domain, we apply domain adversarial learning across multiple camera views for minimizing camera domain discrepancy. To further enhance our approach, we propose person part-level adaptation to effectively perform multi-camera domain invariant feature learning at different spatial regions. We carry out comprehensive experiments on four public re-ID datasets (i.e., PRID-2011, iLIDS-VID, MARS, and Market1501). Our method outperforms state-of-the-art methods by a large margin of about 20% in terms of rank-1 accuracy on the large-scale MARS dataset.
In this paper, we introduce a self-supervised approach for video object segmentation without human labeled data.Specifically, we present Robust Pixel-level Matching Net-works (RPM-Net), a novel deep architecture that matches pixels between adjacent frames, using only color information from unlabeled videos for training. Technically, RPM-Net can be separated in two main modules. The embed-ding module first projects input images into high dimensional embedding space. Then the matching module with deformable convolution layers matches pixels between reference and target frames based on the embedding features.Unlike previous methods using deformable convolution, our matching module adopts deformable convolution to focus on similar features in spatio-temporally neighboring pixels.Our experiments show that the selective feature sampling improves the robustness to challenging problems in video object segmentation such as camera shake, fast motion, deformation, and occlusion. Also, we carry out comprehensive experiments on three public datasets (i.e., DAVIS-2017,SegTrack-v2, and Youtube-Objects) and achieve state-of-the-art performance on self-supervised video object seg-mentation. Moreover, we significantly reduce the performance gap between self-supervised and fully-supervised video object segmentation (41.0% vs. 52.5% on DAVIS-2017 validation set)
Thesedays, Convolutional Neural Networks are widely used in semantic segmentation. However, since CNN-based segmentation networks produce low-resolution outputs with rich semantic information, it is inevitable that spatial details (e.g., small bjects and fine boundary information) of segmentation results will be lost. To address this problem, motivated by a variational approach to image segmentation (i.e., level set theory), we propose a novel loss function called the level set loss which is designed to refine spatial details of segmentation results. To deal with multiple classes in an image, we first decompose the ground truth into binary images. Note that each binary image consists of background and regions belonging to a class. Then we convert level set functions into class probability maps and calculate the energy for each class. The network is trained to minimize the weighted sum of the level set loss and the cross-entropy loss. The proposed level set loss improves the spatial details of segmentation results in a time and memory efficient way. Furthermore, our experimental results show that the proposed loss function achieves better performance than previous approaches.
Deep learning-based object detectors have shown remarkable improvements. However, supervised learning-based methods perform poorly when the train data and the test data have different distributions. To address the issue, domain adaptation transfers knowledge from the label-sufficient domain (source domain) to the label-scarce domain (target domain). Self-training is one of the powerful ways to achieve domain adaptation since it helps class-wise domain adaptation. Unfortunately, a naive approach that utilizes pseudo-labels as ground-truth degenerates the performance due to incorrect pseudo-labels. In this paper, we introduce a weak self-training (WST) method and adversarial background score regularization (BSR) for domain adaptive one-stage object detection. WST diminishes the adverse effects of inaccurate pseudo-labels to stabilize the learning procedure. BSR helps the network extract discriminative features for target backgrounds to reduce the domain shift. Two components are complementary to each other as BSR enhances discrimination between foregrounds and backgrounds, whereas WST strengthen class-wise discrimination. Experimental results show that our approach effectively improves the performance of the one-stage object detection in unsupervised domain adaptation setting.
Deep learning-based semantic segmentation methods have an intrinsic limitation that training a model requires a large amount of data with pixel-level annotations. To address this challenging issue, many researchers give attention to unsupervised domain adaptation for semantic segmentation. Unsupervised domain adaptation seeks to adapt the model trained on the source domain to the target domain. In this paper, we introduce a self-ensembling technique, one of the successful methods for domain adaptation in classification. However, applying self-ensembling to semantic segmentation is very difficult because heavily-tuned manual data augmentation used in self-ensembling is not useful to reduce the large domain gap in the semantic segmentation. To overcome this limitation, we propose a novel framework consisting of two components, which are complementary to each other. First, we present a data augmentation method based on Generative Adversarial Networks (GANs), which is computationally efficient and effective to facilitate domain alignment. Given those augmented images, we apply self-ensembling to enhance the performance of the segmentation network on the target domain. The proposed method outperforms state-of-the-art semantic segmentation methods on unsupervised domain adaptation benchmarks.
To learn target discriminative representations, using pseudo-labels is a simple yet effective approach for unsupervised domain adaptation. However, the existence of false pseudo-labels, which may have a detrimental influence on learning target representations, remains a major challenge. To overcome this issue, we propose a pseudo-labeling curriculum based on a density-based clustering algorithm. Since samples with high density values are more likely to have correct pseudo-labels, we leverage these subsets to train our target network at the early stage, and utilize data subsets with low density values at the later stage. We can progressively improve the capability of our network to generate pseudo-labels, and thus these target samples with pseudo-labels are effective for training our model. Moreover, we present a clustering constraint to enhance the discriminative power of the learned target features. Our approach achieves state-of-the-art performance on three benchmarks: Office-31, imageCLEF-DA, and Office-Home.