Nanyang Technological University
Abstract:Despite the great success of GANs in images translation with different conditioned inputs such as semantic segmentation and edge maps, generating high-fidelity realistic images with reference styles remains a grand challenge in conditional image-to-image translation. This paper presents a general image translation framework that incorporates optimal transport for feature alignment between conditional inputs and style exemplars in image translation. The introduction of optimal transport mitigates the constraint of many-to-one feature matching significantly while building up accurate semantic correspondences between conditional inputs and exemplars. We design a novel unbalanced optimal transport to address the transport between features with deviational distributions which exists widely between conditional inputs and exemplars. In addition, we design a semantic-activation normalization scheme that injects style features of exemplars into the image translation process successfully. Extensive experiments over multiple image translation tasks show that our method achieves superior image translation qualitatively and quantitatively as compared with the state-of-the-art.
Abstract:Deep learning-based multi-source unsupervised domain adaptation (MUDA) has been actively studied in recent years. Compared with single-source unsupervised domain adaptation (SUDA), domain shift in MUDA exists not only between the source and target domains but also among multiple source domains. Most existing MUDA algorithms focus on extracting domain-invariant representations among all domains whereas the task-specific decision boundaries among classes are largely neglected. In this paper, we propose an end-to-end trainable network that exploits domain Consistency Regularization for unsupervised Multi-source domain Adaptive classification (CRMA). CRMA aligns not only the distributions of each pair of source and target domains but also that of all domains. For each pair of source and target domains, we employ an intra-domain consistency to regularize a pair of domain-specific classifiers to achieve intra-domain alignment. In addition, we design an inter-domain consistency that targets joint inter-domain alignment among all domains. To address different similarities between multiple source domains and the target domain, we design an authorization strategy that assigns different authorities to domain-specific classifiers adaptively for optimal pseudo label prediction and self-training. Extensive experiments show that CRMA tackles unsupervised domain adaptation effectively under a multi-source setup and achieves superior adaptation consistently across multiple MUDA datasets.
Abstract:Unsupervised domain adaptation (UDA) aims to learn a well-performed model in an unlabeled target domain by leveraging labeled data from one or multiple related source domains. It remains a great challenge due to 1) the lack of annotations in the target domain and 2) the rich discrepancy between the distributions of source and target data. We propose Spectral UDA (SUDA), an efficient yet effective UDA technique that works in the spectral space and is generic across different visual recognition tasks in detection, classification and segmentation. SUDA addresses UDA challenges from two perspectives. First, it mitigates inter-domain discrepancies by a spectrum transformer (ST) that maps source and target images into spectral space and learns to enhance domain-invariant spectra while suppressing domain-variant spectra simultaneously. To this end, we design novel adversarial multi-head spectrum attention that leverages contextual information to identify domain-variant and domain-invariant spectra effectively. Second, it mitigates the lack of annotations in target domain by introducing multi-view spectral learning which aims to learn comprehensive yet confident target representations by maximizing the mutual information among multiple ST augmentations capturing different spectral views of each target sample. Extensive experiments over different visual tasks (e.g., detection, classification and segmentation) show that SUDA achieves superior accuracy and it is also complementary with state-of-the-art UDA methods with consistent performance boosts but little extra computation.
Abstract:Instance contrast for unsupervised representation learning has achieved great success in recent years. In this work, we explore the idea of instance contrastive learning in unsupervised domain adaptation (UDA) and propose a novel Category Contrast technique (CaCo) that introduces semantic priors on top of instance discrimination for visual UDA tasks. By considering instance contrastive learning as a dictionary look-up operation, we construct a semantics-aware dictionary with samples from both source and target domains where each target sample is assigned a (pseudo) category label based on the category priors of source samples. This allows category contrastive learning (between target queries and the category-level dictionary) for category-discriminative yet domain-invariant feature representations: samples of the same category (from either source or target domain) are pulled closer while those of different categories are pushed apart simultaneously. Extensive UDA experiments in multiple visual tasks ($e.g.$, segmentation, classification and detection) show that the simple implementation of CaCo achieves superior performance as compared with the highly-optimized state-of-the-art methods. Analytically and empirically, the experiments also demonstrate that CaCo is complementary to existing UDA methods and generalizable to other learning setups such as semi-supervised learning, unsupervised model adaptation, etc.
Abstract:Unsupervised domain adaptation (UDA) involves a supervised loss in a labeled source domain and an unsupervised loss in an unlabeled target domain, which often faces more severe overfitting (than classical supervised learning) as the supervised source loss has clear domain gap and the unsupervised target loss is often noisy due to the lack of annotations. This paper presents RDA, a robust domain adaptation technique that introduces adversarial attacking to mitigate overfitting in UDA. We achieve robust domain adaptation by a novel Fourier adversarial attacking (FAA) method that allows large magnitude of perturbation noises but has minimal modification of image semantics, the former is critical to the effectiveness of its generated adversarial samples due to the existence of 'domain gaps'. Specifically, FAA decomposes images into multiple frequency components (FCs) and generates adversarial samples by just perturbating certain FCs that capture little semantic information. With FAA-generated samples, the training can continue the 'random walk' and drift into an area with a flat loss landscape, leading to more robust domain adaptation. Extensive experiments over multiple domain adaptation tasks show that RDA can work with different computer vision tasks with superior performance.
Abstract:Contemporary domain adaptive semantic segmentation aims to address data annotation challenges by assuming that target domains are completely unannotated. However, annotating a few target samples is usually very manageable and worthwhile especially if it improves the adaptation performance substantially. This paper presents SSDAS, a Semi-Supervised Domain Adaptive image Segmentation network that employs a few labeled target samples as anchors for adaptive and progressive feature alignment between labeled source samples and unlabeled target samples. We position the few labeled target samples as references that gauge the similarity between source and target features and guide adaptive inter-domain alignment for learning more similar source features. In addition, we replace the dissimilar source features by high-confidence target features continuously during the iterative training process, which achieves progressive intra-domain alignment between confident and unconfident target features. Extensive experiments show the proposed SSDAS greatly outperforms a number of baselines, i.e., UDA-based semantic segmentation and SSDA-based image classification. In addition, SSDAS is complementary and can be easily incorporated into UDA-based methods with consistent improvements in domain adaptive semantic segmentation.
Abstract:Leveraging the advances of natural language processing, most recent scene text recognizers adopt an encoder-decoder architecture where text images are first converted to representative features and then a sequence of characters via `direct decoding'. However, scene text images suffer from rich noises of different sources such as complex background and geometric distortions which often confuse the decoder and lead to incorrect alignment of visual features at noisy decoding time steps. This paper presents I2C2W, a novel scene text recognizer that is accurate and tolerant to various noises in scenes. I2C2W consists of an image-to-character module (I2C) and a character-to-word module (C2W) which are complementary and can be trained end-to-end. I2C detects characters and predicts their relative positions in a word. It strives to detect all possible characters including incorrect and redundant ones based on different alignments of visual features without the restriction of time steps. Taking the detected characters as input, C2W learns from character semantics and their positions to filter out incorrect and redundant detection and produce the final word recognition. Extensive experiments over seven public datasets show that I2C2W achieves superior recognition performances and outperforms the state-of-the-art by large margins on challenging irregular scene text datasets.
Abstract:Image inpainting is an underdetermined inverse problem, it naturally allows diverse contents that fill up the missing or corrupted regions reasonably and realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in transformers is suboptimal as corrupted regions can have arbitrary shapes with contexts from arbitrary directions. We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT) that models deep bidirectional contexts for autoregressive generation of diverse inpainting contents. BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers. Specifically, it first generates pluralistic image structures of low resolution by adapting transformers and then synthesizes realistic texture details of high resolutions with a CNN-based up-sampling network. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.
Abstract:The prevalent approach in domain adaptive object detection adopts a two-stage architecture (Faster R-CNN) that involves a number of hyper-parameters and hand-crafted designs such as anchors, region pooling, non-maximum suppression, etc. Such architecture makes it very complicated while adopting certain existing domain adaptation methods with different ways of feature alignment. In this work, we adopt a one-stage detector and design DA-DETR, a simple yet effective domain adaptive object detection network that performs inter-domain alignment with a single discriminator. DA-DETR introduces a hybrid attention module that explicitly pinpoints the hard-aligned features for simple yet effective alignment across domains. It greatly simplifies traditional domain adaptation pipelines by eliminating sophisticated routines that involve multiple adversarial learning frameworks with different types of features. Despite its simplicity, extensive experiments show that DA-DETR demonstrates superior accuracy as compared with highly-optimized state-of-the-art approaches.
Abstract:Automated defect inspection is critical for effective and efficient maintenance, repair, and operations in advanced manufacturing. On the other hand, automated defect inspection is often constrained by the lack of defect samples, especially when we adopt deep neural networks for this task. This paper presents Defect-GAN, an automated defect synthesis network that generates realistic and diverse defect samples for training accurate and robust defect inspection networks. Defect-GAN learns through defacement and restoration processes, where the defacement generates defects on normal surface images while the restoration removes defects to generate normal images. It employs a novel compositional layer-based architecture for generating realistic defects within various image backgrounds with different textures and appearances. It can also mimic the stochastic variations of defects and offer flexible control over the locations and categories of the generated defects within the image background. Extensive experiments show that Defect-GAN is capable of synthesizing various defects with superior diversity and fidelity. In addition, the synthesized defect samples demonstrate their effectiveness in training better defect inspection networks.