From generating never-before-seen images to domain adaptation, applications of Generative Adversarial Networks (GANs) spread wide in the domain of vision and graphics problems. With the remarkable ability of GANs in learning the distribution and generating images of a particular class, they can be used for semi-supervised classification tasks. However, the problem is that if two classes of images share similar characteristics, the GAN might learn to generalize and hinder the classification of the two classes. In this paper, we use various images from MNIST and Fashion-MNIST datasets to illustrate how similar images cause the GAN to generalize, leading to the poor classification of images. We propose a modification to the traditional training of GANs that allows for improved multi-class classification in similar classes of images in a semi-supervised learning framework.
Camera traps enable the automatic collection of large quantities of image data. Ecologists use camera traps to monitor animal populations all over the world. In order to estimate the abundance of a species from camera trap data, ecologists need to know not just which species were seen, but also how many individuals of each species were seen. Object detection techniques can be used to find the number of individuals in each image. However, since camera traps collect images in motion-triggered bursts, simply adding up the number of detections over all frames is likely to lead to an incorrect estimate. Overcoming these obstacles may require incorporating spatio-temporal reasoning or individual re-identification in addition to traditional species detection and classification. We have prepared a challenge where the training data and test data are from different cameras spread across the globe. The set of species seen in each camera overlap, but are not identical. The challenge is to classify species and count individual animals across sequences in the test cameras.
Robust road segmentation is a key challenge in self-driving research. Though many image-based methods have been studied and high performances in dataset evaluations have been reported, developing robust and reliable road segmentation is still a major challenge. Data fusion across different sensors to improve the performance of road segmentation is widely considered an important and irreplaceable solution. In this paper, we propose a novel structure to fuse image and LiDAR point cloud in an end-to-end semantic segmentation network, in which the fusion is performed at decoder stage instead of at, more commonly, encoder stage. During fusion, we improve the multi-scale LiDAR map generation to increase the precision of the multi-scale LiDAR map by introducing pyramid projection method. Additionally, we adapted the multi-path refinement network with our fusion strategy and improve the road prediction compared with transpose convolution with skip layers. Our approach has been tested on KITTI ROAD dataset and has competitive performance.
Reducing traffic fatalities and serious injuries is a top priority of the US Department of Transportation. The computer vision (CV)-based crash anticipation in the near-crash phase is receiving growing attention. The ability to perceive fatal crash risks earlier is also critical because it will improve the reliability of crash anticipation. Yet, annotated image data for training a reliable AI model for the early visual perception of crash risks are not abundant. The Fatality Analysis Reporting System contains big data of fatal crashes. It is a reliable data source for learning the relationship between driving scene characteristics and fatal crashes to compensate for the limitation of CV. Therefore, this paper develops a data analytics model, named scenario-wise, Spatio-temporal attention guidance, from fatal crash report data, which can estimate the relevance of detected objects to fatal crashes from their environment and context information. First, the paper identifies five sparse variables that allow for decomposing the 5-year fatal crash dataset to develop scenario-wise attention guidance. Then, exploratory analysis of location- and time-related variables of the crash report data suggests reducing fatal crashes to spatially defined groups. The group's temporal pattern is an indicator of the similarity of fatal crashes in the group. Hierarchical clustering and K-means clustering merge the spatially defined groups into six clusters according to the similarity of their temporal patterns. After that, association rule mining discovers the statistical relationship between the temporal information of driving scenes with crash features, for each cluster. The paper shows how the developed attention guidance supports the design and implementation of a preliminary CV model that can identify objects of a possibility to involve in fatal crashes from their environment and context information.
Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions. At each iteration, embeddings of the current sampling step are fed into a transformer encoder layer, and a group of sampling offsets is predicted to update the sampling locations for the next step. The progressive sampling is differentiable. When combined with the Vision Transformer, the obtained PS-ViT network can adaptively learn where to look. The proposed PS-ViT is both effective and efficient. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy with about $4\times$ fewer parameters and $10\times$ fewer FLOPs. Code is available at https://github.com/yuexy/PS-ViT.
Averaging predictions over a set of models -- an ensemble -- is widely used to improve predictive performance and uncertainty estimation of deep learning models. At the same time, many machine learning systems, such as search, matching, and recommendation systems, heavily rely on embeddings. Unfortunately, due to misalignment of features of independently trained models, embeddings, cannot be improved with a naive deep ensemble like approach. In this work, we look at the ensembling of representations and propose mean embeddings with test-time augmentation (MeTTA) simple yet well-performing recipe for ensembling representations. Empirically we demonstrate that MeTTA significantly boosts the quality of linear evaluation on ImageNet for both supervised and self-supervised models. Even more exciting, we draw connections between MeTTA, image retrieval, and transformation invariant models. We believe that spreading the success of ensembles to inference higher-quality representations is the important step that will open many new applications of ensembling.
Acquisition of Synthetic Aperture Sonar (SAS) datasets is bottlenecked by the costly deployment of SAS imaging systems, and even when data acquisition is possible,the data is often skewed towards containing barren seafloor rather than objects of interest. We present a novel pipeline, called SAS GAN, which couples an optical renderer with a generative adversarial network (GAN) to synthesize realistic SAS images of targets on the seafloor. This coupling enables high levels of SAS image realism while enabling control over image geometry and parameters. We demonstrate qualitative results by presenting examples of images created with our pipeline. We also present quantitative results through the use of t-SNE and the Fr\'echet Inception Distance to argue that our generated SAS imagery potentially augments SAS datasets more effectively than an off-the-shelf GAN.
Cycle consistency is widely used for face editing. However, we observe that the generator tends to find a tricky way to hide information from the original image to satisfy the constraint of cycle consistency, making it impossible to maintain the rich details (e.g., wrinkles and moles) of non-editing areas. In this work, we propose a simple yet effective method named HifaFace to address the above-mentioned problem from two perspectives. First, we relieve the pressure of the generator to synthesize rich details by directly feeding the high-frequency information of the input image into the end of the generator. Second, we adopt an additional discriminator to encourage the generator to synthesize rich details. Specifically, we apply wavelet transformation to transform the image into multi-frequency domains, among which the high-frequency parts can be used to recover the rich details. We also notice that a fine-grained and wider-range control for the attribute is of great importance for face editing. To achieve this goal, we propose a novel attribute regression loss. Powered by the proposed framework, we achieve high-fidelity and arbitrary face editing, outperforming other state-of-the-art approaches.
The use of Convolutional neural networks (ConvNets) in medical imaging research has become widespread in recent years. However, a major drawback of these methods is that they require a large number of annotated training images. Data augmentation has been proposed to alleviate this. One data augmentation strategy is to apply random deformation to existing image data, but the deformed images often will not follow exhibit realistic shape or intensity patterns. In this paper, we present a novel, ConvNet based image registration method for creating patient-like digital phantoms from the existing computerized phantoms. Unlike existing learning-based registration techniques, for which the performance predominantly depends on the domain-specific training images, the proposed method is fully unsupervised, meaning that it optimizes an objective function independently of training data for a given image pair. While classical methods registration also do not require training data, they work in lower-dimensional parameter space; the proposed approach operates directly in the high-dimensional parameter space without any training beforehand. In this paper, we show that the resulting deformed phantom competently matches the anatomy model of a real human while providing the "gold-standard" for the anatomies. Combined with simulation programs, the generated phantoms could potentially serve as a data augmentation tool in today's deep learning studies.
This brief sketches initial progress towards a unified energy-based solution for the semi-supervised visual anomaly detection and localization problem. In this setup, we have access to only anomaly-free training data and want to detect and identify anomalies of an arbitrary nature on test data. We employ the density estimates from the energy-based model (EBM) as normalcy scores that can be used to discriminate normal images from anomalous ones. Further, we back-propagate the gradients of the energy score with respect to the image in order to generate a gradient map that provides pixel-level spatial localization of the anomalies in the image. In addition to the spatial localization, we show that simple processing of the gradient map can also provide alternative normalcy scores that either match or surpass the detection performance obtained with the energy value. To quantitatively validate the performance of the proposed method, we conduct experiments on the MVTec industrial dataset. Though still preliminary, our results are very promising and reveal the potential of EBMs for simultaneously detecting and localizing unforeseen anomalies in images.