School of Electrical Engineering, Tel Aviv University, Tel Aviv, Israel
Abstract:Nowadays, many of the images captured are "observed" by machines only and not by humans, for example, robots' or autonomous cars' cameras. High-level machine vision models, such as object recognition, assume images are transformed to some canonical image space by the camera ISP. However, the camera ISP is optimized for producing visually pleasing images to human observers and not for machines, thus, one may spare the ISP compute time and apply the vision models directly to the raw data. Yet, it has been shown that training such models directly on the RAW images results in a performance drop. To mitigate this drop in performance (without the need to annotate RAW data), we use a dataset of RAW and RGB image pairs, which can be easily acquired with no human labeling. We then train a model that is applied directly to the RAW data by using knowledge distillation such that the model predictions for RAW images will be aligned with the predictions of an off-the-shelf pre-trained model for processed RGB images. Our experiments show that our performance on RAW images is significantly better than a model trained on labeled RAW images. It also reasonably matches the predictions of a pre-trained model on processed RGB images, while saving the ISP compute overhead.
Abstract:While GAN is a powerful model for generating images, its inability to infer a latent space directly limits its use in applications requiring an encoder. Our paper presents a simple architectural setup that combines the generative capabilities of GAN with an encoder. We accomplish this by combining the encoder with the discriminator using shared weights, then training them simultaneously using a new loss term. We model the output of the encoder latent space via a GMM, which leads to both good clustering using this latent space and improved image generation by the GAN. Our framework is generic and can be easily plugged into any GAN strategy. In particular, we demonstrate it both with Vanilla GAN and Wasserstein GAN, where in both it leads to an improvement in the generated images in terms of both the IS and FID scores. Moreover, we show that our encoder learns a meaningful representation as its clustering results are competitive with the current GAN-based state-of-the-art in clustering.
Abstract:Few-shot learning methods offer pre-training techniques optimized for easier later adaptation of the model to new classes (unseen during training) using one or a few examples. This adaptivity to unseen classes is especially important for many practical applications where the pre-trained label space cannot remain fixed for effective use and the model needs to be "specialized" to support new categories on the fly. One particularly interesting scenario, essentially overlooked by the few-shot literature, is Coarse-to-Fine Few-Shot (C2FS), where the training classes (e.g. animals) are of much `coarser granularity' than the target (test) classes (e.g. breeds). A very practical example of C2FS is when the target classes are sub-classes of the training classes. Intuitively, it is especially challenging as (both regular and few-shot) supervised pre-training tends to learn to ignore intra-class variability which is essential for separating sub-classes. In this paper, we introduce a novel 'Angular normalization' module that allows to effectively combine supervised and self-supervised contrastive pre-training to approach the proposed C2FS task, demonstrating significant gains in a broad study over multiple baselines and datasets. We hope that this work will help to pave the way for future research on this new, challenging, and very practical topic of C2FS classification.
Abstract:We propose new, and robust, loss functions for the point cloud registration problem. Our loss functions are inspired by the Best Buddies Similarity (BBS) measure that counts the number of mutual nearest neighbors between two point sets. This measure has been shown to be robust to outliers and missing data in the case of template matching for images. We present several algorithms, collectively named Best Buddy Registration (BBR), where each algorithm consists of optimizing one of these loss functions with Adam gradient descent. The loss functions differ in several ways, including the distance function used (point-to-point vs. point-to-plane), and how the BBS measure is combined with the actual distances between pairs of points. Experiments on various data sets, both synthetic and real, demonstrate the effectiveness of the BBR algorithms, showing that they are quite robust to noise, outliers, and distractors, and cope well with extremely sparse point clouds. One variant, BBR-F, achieves state-of-the-art accuracy in the registration of automotive lidar scans taken up to several seconds apart, from the KITTI and Apollo-Southbay datasets.
Abstract:A major factor in the success of deep neural networks is the use of sophisticated architectures rather than the classical multilayer perceptron (MLP). Residual networks (ResNets) stand out among these powerful modern architectures. Previous works focused on the optimization advantages of deep ResNets over deep MLPs. In this paper, we show another distinction between the two models, namely, a tendency of ResNets to promote smoother interpolations than MLPs. We analyze this phenomenon via the neural tangent kernel (NTK) approach. First, we compute the NTK for a considered ResNet model and prove its stability during gradient descent training. Then, we show by various evaluation methodologies that the NTK of ResNet, and its kernel regression results, are smoother than the ones of MLP. The better smoothness observed in our analysis may explain the better generalization ability of ResNets and the practice of moderately attenuating the residual blocks.
Abstract:Light field photography enables to record 4D images, containing angular information alongside spatial information of the scene. One of the important applications of light field imaging is post-capture refocusing. Current methods require for this purpose a dense field of angle views; those can be acquired with a micro-lens system or with a compressive system. Both techniques have major drawbacks to consider, including bulky structures and angular-spatial resolution trade-off. We present a novel implementation of digital refocusing based on sparse angular information using neural networks. This allows recording high spatial resolution in favor of the angular resolution, thus, enabling to design compact and simple devices with improved hardware as well as better performance of compressive systems. We use a novel convolutional neural network whose relatively small structure enables fast reconstruction with low memory consumption. Moreover, it allows handling without re-training various refocusing ranges and noise levels. Results show major improvement compared to existing methods.
Abstract:In this paper, we introduce a deep learning technique for consolidating and sharp feature generation of point clouds using only the input point cloud itself. Rather than explicitly define a prior that describes typical shape characteristics (i.e., piecewise-smoothness), or a heuristic policy for generating novel sharp points, we opt to learn both using a neural network with shared-weights. Instead of relying on a large collection of manually annotated data, we use the self-supervision present within a single shape, i.e., self-prior, to train the network, and learn the underlying distribution of sharp features specific to the given input point cloud. By learning to map a low-curvature subset of the input point cloud to a disjoint high-curvature subset, the network formalizes the shape-specific characteristics and infers how to generate sharp points. During test time, the network is repeatedly fed a random subset of points from the input and displaces them to generate an arbitrarily large set of novel sharp feature points. The local shared weights are optimized over the entire shape, learning non-local statistics and exploiting the recurrence of local-scale geometries. We demonstrate the ability to generate coherent sets of sharp feature points on a variety of shapes, while eliminating outliers and noise.
Abstract:Neural Architecture Search (NAS) has been used recently to achieve improved performance in various tasks and most prominently in image classification. Yet, current search strategies rely on large labeled datasets, which limit their usage in the case where only a smaller fraction of the data is annotated. Self-supervised learning has shown great promise in training neural networks using unlabeled data. In this work, we propose a self-supervised neural architecture search (SSNAS) that allows finding novel network models without the need for labeled data. We show that such a search leads to comparable results to supervised training with a "fully labeled" NAS and that it can improve the performance of self-supervised learning. Moreover, we demonstrate the advantage of the proposed approach when the number of labels in the search is relatively small.
Abstract:Recently, deep generative adversarial networks for image generation have advanced rapidly; yet, only a small amount of research has focused on generative models for irregular structures, particularly meshes. Nonetheless, mesh generation and synthesis remains a fundamental topic in computer graphics. In this work, we propose a novel framework for synthesizing geometric textures. It learns geometric texture statistics from local neighborhoods (i.e., local triangular patches) of a single reference 3D model. It learns deep features on the faces of the input triangulation, which is used to subdivide and generate offsets across multiple scales, without parameterization of the reference or target mesh. Our network displaces mesh vertices in any direction (i.e., in the normal and tangential direction), enabling synthesis of geometric textures, which cannot be expressed by a simple 2D displacement map. Learning and synthesizing on local geometric patches enables a genus-oblivious framework, facilitating texture transfer between shapes of different genus.
Abstract:Face verification is a fast-growing authentication tool for everyday systems, such as smartphones. While current 2D face recognition methods are very accurate, it has been suggested recently that one may wish to add a 3D sensor to such solutions to make them more reliable and robust to spoofing, e.g., using a 2D print of a person's face. Yet, this requires an additional relatively expensive depth sensor. To mitigate this, we propose a novel authentication system, based on slim grayscale coded light field imaging. We provide a reconstruction free fast anti-spoofing mechanism, working directly on the coded image. It is followed by a multi-view, multi-modal face verification network that given grayscale data together with a low-res depth map achieves competitive results to the RGB case. We demonstrate the effectiveness of our solution on a simulated 3D (RGBD) version of LFW, which will be made public, and a set of real faces acquired by a light field computational camera.