Self-supervised learning (SSL) has made enormous progress and largely narrowed the gap with the supervised ones, where the representation learning is mainly guided by a projection into an embedding space. During the projection, current methods simply adopt uniform aggregation of pixels for embedding; however, this risks involving object-irrelevant nuisances and spatial misalignment for different augmentations. In this paper, we present a new approach, Learning Where to Learn (LEWEL), to adaptively aggregate spatial information of features, so that the projected embeddings could be exactly aligned and thus guide the feature learning better. Concretely, we reinterpret the projection head in SSL as a per-pixel projection and predict a set of spatial alignment maps from the original features by this weight-sharing projection head. A spectrum of aligned embeddings is thus obtained by aggregating the features with spatial weighting according to these alignment maps. As a result of this adaptive alignment, we observe substantial improvements on both image-level prediction and dense prediction at the same time: LEWEL improves MoCov2 by 1.6%/1.3%/0.5%/0.4% points, improves BYOL by 1.3%/1.3%/0.7%/0.6% points, on ImageNet linear/semi-supervised classification, Pascal VOC semantic segmentation, and object detection, respectively.
Graph Neural Networks (GNNs) are deep learning models that take graph data as inputs, and they are applied to various tasks such as traffic prediction and molecular property prediction. However, owing to the complexity of the GNNs, it has been difficult to analyze which parts of inputs affect the GNN model's outputs. In this study, we extend explainability methods for Convolutional Neural Networks (CNNs), such as Local Interpretable Model-Agnostic Explanations (LIME), Gradient-Based Saliency Maps, and Gradient-Weighted Class Activation Mapping (Grad-CAM) to GNNs, and predict which edges in the input graphs are important for GNN decisions. The experimental results indicate that the LIME-based approach is the most efficient explainability method for multiple tasks in the real-world situation, outperforming even the state-of-the-art method in GNN explainability.
Unsupervised video-based person re-identification (re-ID) methods extract richer features from video tracklets than image-based ones. The state-of-the-art methods utilize clustering to obtain pseudo-labels and train the models iteratively. However, they underestimate the influence of two kinds of frames in the tracklet: 1) noise frames caused by detection errors or heavy occlusions exist in the tracklet, which may be allocated with unreliable labels during clustering; 2) the tracklet also contains hard frames caused by pose changes or partial occlusions, which are difficult to distinguish but informative. This paper proposes a Noise and Hard frame Aware Clustering (NHAC) method. NHAC consists of a graph trimming module and a node re-sampling module. The graph trimming module obtains stable graphs by removing noise frame nodes to improve the clustering accuracy. The node re-sampling module enhances the training of hard frame nodes to learn rich tracklet information. Experiments conducted on two video-based datasets demonstrate the effectiveness of the proposed NHAC under the unsupervised re-ID setting.
Shadow removal is an essential task in computer vision and computer graphics. Recent shadow removal approaches all train convolutional neural networks (CNN) on real paired shadow/shadow-free or shadow/shadow-free/mask image datasets. However, obtaining a large-scale, diverse, and accurate dataset has been a big challenge, and it limits the performance of the learned models on shadow images with unseen shapes/intensities. To overcome this challenge, we present SynShadow, a novel large-scale synthetic shadow/shadow-free/matte image triplets dataset and a pipeline to synthesize it. We extend a physically-grounded shadow illumination model and synthesize a shadow image given an arbitrary combination of a shadow-free image, a matte image, and shadow attenuation parameters. Owing to the diversity, quantity, and quality of SynShadow, we demonstrate that shadow removal models trained on SynShadow perform well in removing shadows with diverse shapes and intensities on some challenging benchmarks. Furthermore, we show that merely fine-tuning from a SynShadow-pre-trained model improves existing shadow detection and removal models. Codes are publicly available at https://github.com/naoto0804/SynShadow.
With expansion of the video advertising market, research to predict the effects of video advertising is getting more attention. Although effect prediction of image advertising has been explored a lot, prediction for video advertising is still challenging with seldom research. In this research, we propose a method for predicting the click through rate (CTR) of video advertisements and analyzing the factors that determine the CTR. In this paper, we demonstrate an optimized framework for accurately predicting the effects by taking advantage of the multimodal nature of online video advertisements including video, text, and metadata features. In particular, the two types of metadata, i.e., categorical and continuous, are properly separated and normalized. To avoid overfitting, which is crucial in our task because the training data are not very rich, additional regularization layers are inserted. Experimental results show that our approach can achieve a correlation coefficient as high as 0.695, which is a significant improvement from the baseline (0.487).
In recent years, deep neural networks (DNNs) have achieved equivalent or even higher accuracy in various recognition tasks than humans. However, some images exist that lead DNNs to a completely wrong decision, whereas humans never fail with these images. Among others, fooling images are those that are not recognizable as natural objects such as dogs and cats, but DNNs classify these images into classes with high confidence scores. In this paper, we propose a new class of fooling images, sparse fooling images (SFIs), which are single color images with a small number of altered pixels. Unlike existing fooling images, which retain some characteristic features of natural objects, SFIs do not have any local or global features that can be recognizable to humans; however, in machine perception (i.e., by DNN classifiers), SFIs are recognizable as natural objects and classified to certain classes with high confidence scores. We propose two methods to generate SFIs for different settings~(semiblack-box and white-box). We also experimentally demonstrate the vulnerability of DNNs through out-of-distribution detection and compare three architectures in terms of the robustness against SFIs. This study gives rise to questions on the structure and robustness of CNNs and discusses the differences between human and machine perception.
In this paper, we present a novel image inpainting technique using frequency domain information. Prior works on image inpainting predict the missing pixels by training neural networks using only the spatial domain information. However, these methods still struggle to reconstruct high-frequency details for real complex scenes, leading to a discrepancy in color, boundary artifacts, distorted patterns, and blurry textures. To alleviate these problems, we investigate if it is possible to obtain better performance by training the networks using frequency domain information (Discrete Fourier Transform) along with the spatial domain information. To this end, we propose a frequency-based deconvolution module that enables the network to learn the global context while selectively reconstructing the high-frequency components. We evaluate our proposed method on the publicly available datasets CelebA, Paris Streetview, and DTD texture dataset, and show that our method outperforms current state-of-the-art image inpainting techniques both qualitatively and quantitatively.
Re-identification (re-ID) is currently investigated as a closed-world image retrieval task, and evaluated by retrieval based metrics. The algorithms return ranking lists to users, but cannot tell which images are the true target. In essence, current re-ID overemphasizes the importance of retrieval but underemphasizes that of verification, \textit{i.e.}, all returned images are considered as the target. On the other hand, re-ID should also include the scenario that the query identity does not appear in the gallery. To this end, we go back to the essence of re-ID, \textit{i.e.}, a combination of retrieval and verification in an open-set setting, and put forward a new metric, namely, Genuine Open-set re-ID Metric (GOM). GOM explicitly balances the effect of performing retrieval and verification into a single unified metric. It can also be decomposed into a family of sub-metrics, enabling a clear analysis of re-ID performance. We evaluate the effectiveness of GOM on the re-ID benchmarks, showing its ability to capture important aspects of re-ID performance that have not been taken into account by established metrics so far. Furthermore, we show GOM scores excellent in aligning with human visual evaluation of re-ID performance. Related codes are available at https://github.com/YuanXinCherry/Person-reID-Evaluation
Pretext tasks and contrastive learning have been successful in self-supervised learning for video retrieval and recognition. In this study, we analyze their optimization targets and utilize the hyper-sphere feature space to explore the connections between them, indicating the compatibility and consistency of these two different learning methods. Based on the analysis, we propose a self-supervised training method, referred as Pretext-Contrastive Learning (PCL), to learn video representations. Extensive experiments based on different combinations of pretext task baselines and contrastive losses confirm the strong agreement with their self-supervised learning targets, demonstrating the effectiveness and the generality of PCL. The combination of pretext tasks and contrastive losses showed significant improvements in both video retrieval and recognition over the corresponding baselines. And we can also outperform current state-of-the-art methods in the same manner. Further, our PCL is flexible and can be applied to almost all existing pretext task methods.
In this study, we address image retargeting, which is a task that adjusts input images to arbitrary sizes. In one of the best-performing methods called MULTIOP, multiple retargeting operators were combined and retargeted images at each stage were generated to find the optimal sequence of operators that minimized the distance between original and retargeted images. The limitation of this method is in its tremendous processing time, which severely prohibits its practical use. Therefore, the purpose of this study is to find the optimal combination of operators within a reasonable processing time; we propose a method of predicting the optimal operator for each step using a reinforcement learning agent. The technical contributions of this study are as follows. Firstly, we propose a reward based on self-play, which will be insensitive to the large variance in the content-dependent distance measured in MULTIOP. Secondly, we propose to dynamically change the loss weight for each action to prevent the algorithm from falling into a local optimum and from choosing only the most frequently used operator in its training. Our experiments showed that we achieved multi-operator image retargeting with less processing time by three orders of magnitude and the same quality as the original multi-operator-based method, which was the best-performing algorithm in retargeting tasks.