We present a framework for translating unlabeled images from one domain into analog images in another domain. We employ a progressively growing skip-connected encoder-generator structure and train it with a GAN loss for realistic output, a cycle consistency loss for maintaining same-domain translation identity, and a semantic consistency loss that encourages the network to keep the input semantic features in the output. We apply our framework on the task of translating face images, and show that it is capable of learning semantic mappings for face images with no supervised one-to-one image mapping.
Modeling the distribution of natural images is challenging, partly because of strong statistical dependencies which can extend over hundreds of pixels. Recurrent neural networks have been successful in capturing long-range dependencies in a number of problems but only recently have found their way into generative image models. We here introduce a recurrent image model based on multi-dimensional long short-term memory units which are particularly suited for image modeling due to their spatial structure. Our model scales to images of arbitrary size and its likelihood is computationally tractable. We find that it outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting.
Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.
The domain diversities including inconsistent annotation and varied image collection conditions inevitably exist among different facial expression recognition (FER) datasets, which pose an evident challenge for adapting the FER model trained on one dataset to another one. Recent works mainly focus on domain-invariant deep feature learning with adversarial learning mechanism, ignoring the sibling facial action unit (AU) detection task which has obtained great progress. Considering AUs objectively determine facial expressions, this paper proposes an AU-guided unsupervised Domain Adaptive FER (AdaFER) framework. In AdaFER, we first leverage an advanced model for AU detection on both source and target domain. Then, we compare the AU results to perform AU-guided annotating, i.e., target faces that own the same AUs with source faces would inherit the labels from source domain. Meanwhile, to achieve domain-invariant compact features, we utilize an AU-guided triplet training which randomly collects anchor-positive-negative triplets on both domains with AUs. We conduct extensive experiments on several popular benchmarks and show that AdaFER achieves state-of-the-art results on all the benchmarks.
In recent years, quaternion matrix completion (QMC) based on low-rank regularization has been gradually used in image de-noising and de-blurring.Unlike low-rank matrix completion (LRMC) which handles RGB images by recovering each color channel separately, the QMC models utilize the connection of three channels by processing them as a whole. Most of the existing quaternion-based methods formulate low-rank QMC (LRQMC) as a quaternion nuclear norm (a convex relaxation of the rank) minimization problem.The main limitation of these approaches is that the singular values being minimized simultaneously so that the low-rank property could not be approximated well and efficiently. To achieve a more accurate low-rank approximation, the matrix-based truncated nuclear norm has been proposed and also been proved to have the superiority. In this paper, we introduce a quaternion truncated nuclear norm (QTNN) for LRQMC and utilize the alternating direction method of multipliers (ADMM) to get the optimization.We further propose weights to the residual error quaternion matrix during the update process for accelerating the convergence of the QTNN method with admissible performance. The weighted method utilizes a concise gradient descent strategy which has a theoretical guarantee in optimization. The effectiveness of our method is illustrated by experiments on real visual datasets.
Cattle activity is an essential index for monitoring health and welfare of the ruminants. Thus, changes in the livestock behavior are a critical indicator for early detection and prevention of several diseases. Rumination behavior is a significant variable for tracking the development and yield of animal husbandry. Therefore, various monitoring methods and measurement equipment have been used to assess cattle behavior. However, these modern attached devices are invasive, stressful and uncomfortable for the cattle and can influence negatively the welfare and diurnal behavior of the animal. Multiple research efforts addressed the problem of rumination detection by adopting new methods by relying on visual features. However, they only use few postures of the dairy cow to recognize the rumination or feeding behavior. In this study, we introduce an innovative monitoring method using Convolution Neural Network (CNN)-based deep learning models. The classification process is conducted under two main labels: ruminating and other, using all cow postures captured by the monitoring camera. Our proposed system is simple and easy-to-use which is able to capture long-term dynamics using a compacted representation of a video in a single 2D image. This method proved efficiency in recognizing the rumination behavior with 95%, 98% and 98% of average accuracy, recall and precision, respectively.
This paper presents the evaluation methodology, datasets, and results of the BOP Challenge 2020, the third in a series of public competitions organized with the goal to capture the status quo in the field of 6D object pose estimation from an RGB-D image. In 2020, to reduce the domain gap between synthetic training and real test RGB images, the participants were provided 350K photorealistic trainining images generated by BlenderProc4BOP, a~new open-source and light-weight physically-based renderer (PBR) and procedural data generator. Methods based on deep neural networks have finally caught up with methods based on point pair features, which were dominating previous editions of the challenge. Although the top-performing methods rely on RGB-D image channels, strong results were achieved when only RGB channels were used at both training and test time -- out of 26 evaluated methods, the third method was trained on RGB channels of PBR and real images, while the fifth was trained on PBR images only. Strong data augmentation was identified as a key component of the top-performing CosyPose method, and the photorealism of PBR images was demonstrated effective despite the augmentation. The online evaluation system stays open and is available at the project website: bop.felk.cvut.cz.
The existing methods for image search reranking suffer from the unfaithfulness of the assumptions under which the text-based images search result. The resulting images contain more irrelevant images. Hence the re ranking concept arises to re rank the retrieved images based on the text around the image and data of data of image and visual feature of image. A number of methods are differentiated for this re-ranking. The high ranked images are used as noisy data and a k means algorithm for classification is learned to rectify the ranking further. We are study the affect ability of the cross validation method to this training data. The pre eminent originality of the overall method is in collecting text/metadata of image and visual features in order to achieve an automatic ranking of the images. Supervision is initiated to learn the model weights offline, previous to reranking process. While model learning needs manual labeling of the results for a some limited queries, the resulting model is query autonomous and therefore applicable to any other query .Examples are given for a selection of other classes like vehicles, animals and other classes.
Big neural networks trained on large datasets have advanced the state-of-the-art for a large variety of challenging problems, improving performance by a large margin. However, under low memory and limited computational power constraints, the accuracy on the same problems drops considerable. In this paper, we propose a series of techniques that significantly improve the accuracy of binarized neural networks (i.e networks where both the features and the weights are binary). We evaluate the proposed improvements on two diverse tasks: fine-grained recognition (human pose estimation) and large-scale image recognition (ImageNet classification). Specifically, we introduce a series of novel methodological changes including: (a) more appropriate activation functions, (b) reverse-order initialization, (c) progressive quantization, and (d) network stacking and show that these additions improve existing state-of-the-art network binarization techniques, significantly. Additionally, for the first time, we also investigate the extent to which network binarization and knowledge distillation can be combined. When tested on the challenging MPII dataset, our method shows a performance improvement of more than 4% in absolute terms. Finally, we further validate our findings by applying the proposed techniques for large-scale object recognition on the Imagenet dataset, on which we report a reduction of error rate by 4%.
This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We first learn a category-specific 3D reconstruction model from a collection of single-view images of the same category that jointly predicts the shape, texture, and camera pose of an image. Then, at inference time, we adapt the model to a test video over time using self-supervised regularization terms that exploit temporal consistency of an object instance to enforce that all reconstructed meshes share a common texture map, a base shape, as well as parts. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild -- an extremely challenging task rarely addressed before.