Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.
Low light enhancement has gained increasing importance with the rapid development of visual creation and editing. However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability. Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. Additionally, we incorporate the Segment-Anything Model (SAM) to enable user-friendly region controllability, where users can click on objects to specify the regions they wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves competitive performance regarding quantitative metrics, qualitative results, and versatile controllability. Project page: https://yuyangyin.github.io/CLEDiffusion/
This work addresses the challenging domain adaptation setting in which knowledge from the labelled source domain dataset is available only from the pretrained black-box segmentation model. The pretrained model's predictions for the target domain images are noisy because of the distributional differences between the source domain data and the target domain data. Since the model's predictions serve as pseudo labels during self-training, the noise in the predictions impose an upper bound on model performance. Therefore, we propose a simple yet novel image translation workflow, ReGEN, to address this problem. ReGEN comprises an image-to-image translation network and a segmentation network. Our workflow generates target-like images using the noisy predictions from the original target domain images. These target-like images are semantically consistent with the noisy model predictions and therefore can be used to train the segmentation network. In addition to being semantically consistent with the predictions from the original target domain images, the generated target-like images are also stylistically similar to the target domain images. This allows us to leverage the stylistic differences between the target-like images and the target domain image as an additional source of supervision while training the segmentation model. We evaluate our model with two benchmark domain adaptation settings and demonstrate that our approach performs favourably relative to recent state-of-the-art work. The source code will be made available.
In this paper, we focus on an under-explored issue of biased activation in prior weakly-supervised object localization methods based on Class Activation Mapping (CAM). We analyze the cause of this problem from a causal view and attribute it to the co-occurring background confounders. Following this insight, we propose a novel Counterfactual Co-occurring Learning (CCL) paradigm to synthesize the counterfactual representations via coupling constant foreground and unrealized backgrounds in order to cut off their co-occurring relationship. Specifically, we design a new network structure called Counterfactual-CAM, which embeds the counterfactual representation perturbation mechanism into the vanilla CAM-based model. This mechanism is responsible for decoupling foreground as well as background and synthesizing the counterfactual representations. By training the detection model with these synthesized representations, we compel the model to focus on the constant foreground content while minimizing the influence of distracting co-occurring background. To our best knowledge, it is the first attempt in this direction. Extensive experiments on several benchmarks demonstrate that Counterfactual-CAM successfully mitigates the biased activation problem, achieving improved object localization accuracy.
Virtual try-on of eyeglasses involves placing eyeglasses of different shapes and styles onto a face image without physically trying them on. While existing methods have shown impressive results, the variety of eyeglasses styles is limited and the interactions are not always intuitive or efficient. To address these limitations, we propose a Text-guided Eyeglasses Manipulation method that allows for control of the eyeglasses shape and style based on a binary mask and text, respectively. Specifically, we introduce a mask encoder to extract mask conditions and a modulation module that enables simultaneous injection of text and mask conditions. This design allows for fine-grained control of the eyeglasses' appearance based on both textual descriptions and spatial constraints. Our approach includes a disentangled mapper and a decoupling strategy that preserves irrelevant areas, resulting in better local editing. We employ a two-stage training scheme to handle the different convergence speeds of the various modality conditions, successfully controlling both the shape and style of eyeglasses. Extensive comparison experiments and ablation analyses demonstrate the effectiveness of our approach in achieving diverse eyeglasses styles while preserving irrelevant areas.
In the field of Image-to-Image (I2I) translation, ensuring consistency between input images and their translated results is a key requirement for producing high-quality and desirable outputs. Previous I2I methods have relied on result consistency, which enforces consistency between the translated results and the ground truth output, to achieve this goal. However, result consistency is limited in its ability to handle complex and unseen attribute changes in translation tasks. To address this issue, we introduce a transition-aware approach to I2I translation, where the data translation mapping is explicitly parameterized with a transition variable, allowing for the modelling of unobserved translations triggered by unseen transitions. Furthermore, we propose the use of transition consistency, defined on the transition variable, to enable regularization of consistency on unobserved translations, which is omitted in previous works. Based on these insights, we present Unseen Transition Suss GAN (UTSGAN), a generative framework that constructs a manifold for the transition with a stochastic transition encoder and coherently regularizes and generalizes result consistency and transition consistency on both training and unobserved translations with tailor-designed constraints. Extensive experiments on four different I2I tasks performed on five different datasets demonstrate the efficacy of our proposed UTSGAN in performing consistent translations.
Obtaining sufficient labelled data for model training is impractical for most real-life applications. Therefore, we address the problem of domain generalization for semantic segmentation tasks to reduce the need to acquire and label additional data. Recent work on domain generalization increase data diversity by varying domain-variant features such as colour, style and texture in images. However, excessive stylization or even uniform stylization may reduce performance. Performance reduction is especially pronounced for pixels from minority classes, which are already more challenging to classify compared to pixels from majority classes. Therefore, we introduce a module, $ASH_{+}$, that modulates stylization strength for each pixel depending on the pixel's semantic content. In this work, we also introduce a parameter that balances the element-wise and channel-wise proportion of stylized features with the original source domain features in the stylized source domain images. This learned parameter replaces an empirically determined global hyperparameter, allowing for more fine-grained control over the output stylized image. We conduct multiple experiments to validate the effectiveness of our proposed method. Finally, we evaluate our model on the publicly available benchmark semantic segmentation datasets (Cityscapes and SYNTHIA). Quantitative and qualitative comparisons indicate that our approach is competitive with state-of-the-art. Code is made available at \url{https://github.com/placeholder}
By introducing a new operator theory, we provide a unified mathematical theory for general source resolution in the multi-illumination imaging problem. Our main idea is to transform multi-illumination imaging into single-snapshot imaging with a new imaging kernel that depends on both the illumination patterns and the point spread function of the imaging system. We thus prove that the resolution of multi-illumination imaging is approximately determined by the essential cutoff frequency of the new imaging kernel, which is roughly limited by the sum of the cutoff frequency of the point spread function and the maximum essential frequency in the illumination patterns. Our theory provides a unified way to estimate the resolution of various existing super-resolution modalities and results in the same estimates as those obtained in experiments. In addition, based on the reformulation of the multi-illumination imaging problem, we also estimate the resolution limits for resolving both complex and positive sources by sparsity-based approaches. We show that the resolution of multi-illumination imaging is approximately determined by the new imaging kernel from our operator theory and better resolution can be realized by sparsity-promoting techniques in practice but only for resolving very sparse sources. This explains experimentally observed phenomena in some sparsity-based super-resolution modalities.
In this paper, we analyze the capacity of super-resolution of one-dimensional positive sources. In particular, we consider the same setting as in [arXiv:1904.09186v2 [math.NA]] and generalize the results there to the case of super-resolving positive sources. To be more specific, we consider resolving $d$ positive point sources with $p \leqslant d$ nodes closely spaced and forming a cluster, while the rest of the nodes are well separated. Similarly to [arXiv:1904.09186v2 [math.NA]], our results show that when the noise level $\epsilon \lesssim \mathrm{SRF}^{-2 p+1}$, where $\mathrm{SRF}=(\Omega \Delta)^{-1}$ with $\Omega$ being the cutoff frequency and $\Delta$ the minimal separation between the nodes, the minimax error rate for reconstructing the cluster nodes is of order $\frac{1}{\Omega} \mathrm{SRF}^{2 p-2} \epsilon$, while for recovering the corresponding amplitudes $\left\{a_j\right\}$ the rate is of order $\mathrm{SRF}^{2 p-1} \epsilon$. For the non-cluster nodes, the corresponding minimax rates for the recovery of nodes and amplitudes are of order $\frac{\epsilon}{\Omega}$ and $\epsilon$, respectively. Our numerical experiments show that the Matrix Pencil method achieves the above optimal bounds when resolving the positive sources.