We propose a family of novel hierarchical Bayesian deep auto-encoder models capable of identifying disentangled factors of variability in data. While many recent attempts at factor disentanglement have focused on sophisticated learning objectives within the VAE framework, their choice of a standard normal as the latent factor prior is both suboptimal and detrimental to performance. Our key observation is that the disentangled latent variables responsible for major sources of variability, the relevant factors, can be more appropriately modeled using long-tail distributions. The typical Gaussian priors are, on the other hand, better suited for modeling of nuisance factors. Motivated by this, we extend the VAE to a hierarchical Bayesian model by introducing hyper-priors on the variances of Gaussian latent priors, mimicking an infinite mixture, while maintaining tractable learning and inference of the traditional VAEs. This analysis signifies the importance of partitioning and treating in a different manner the latent dimensions corresponding to relevant factors and nuisances. Our proposed models, dubbed Bayes-Factor-VAEs, are shown to outperform existing methods both quantitatively and qualitatively in terms of latent disentanglement across several challenging benchmark tasks.
Deep Gaussian processes (DGP) have appealing Bayesian properties, can handle variable-sized data, and learn deep features. Their limitation is that they do not scale well with the size of the data. Existing approaches address this using a deep random feature (DRF) expansion model, which makes inference tractable by approximating DGPs. However, DRF is not suitable for variable-sized input data such as trees, graphs, and sequences. We introduce the GP-DRF, a novel Bayesian model with an input layer of GPs, followed by DRF layers. The key advantage is that the combination of GP and DRF leads to a tractable model that can both handle a variable-sized input as well as learn deep long-range dependency structures of the data. We provide a novel efficient method to simultaneously infer the posterior of GP's latent vectors and infer the posterior of DRF's internal weights and random frequencies. Our experiments show that GP-DRF outperforms the standard GP model and DRF model across many datasets. Furthermore, they demonstrate that GP-DRF enables improved uncertainty quantification compared to GP and DRF alone, with respect to a Bhattacharyya distance assessment. Source code is available at https://github.com/IssamLaradji/GP_DRF.
In this work we propose a new computational framework, based on generative deep models, for synthesis of photo-realistic food meal images from textual descriptions of its ingredients. Previous works on synthesis of images from text typically rely on pre-trained text models to extract text features, followed by a generative neural networks (GANs) aimed to generate realistic images conditioned on the text features. These works mainly focus on generating spatially compact and well-defined categories of objects, such as birds or flowers. In contrast, meal images are significantly more complex, consisting of multiple ingredients whose appearance and spatial qualities are further modified by cooking methods. We propose a method that first builds an attention-based ingredients-image association model, which is then used to condition a generative neural network tasked with synthesizing meal images. Furthermore, a cycle-consistent constraint is added to further improve image quality and control appearance. Extensive experiments show our model is able to generate meal image corresponding to the ingredients, which could be used to augment existing dataset for solving other computational food analysis problems.
In this paper, we propose a generative framework that unifies depth-based 3D facial pose tracking and face model adaptation on-the-fly, in the unconstrained scenarios with heavy occlusions and arbitrary facial expression variations. Specifically, we introduce a statistical 3D morphable model that flexibly describes the distribution of points on the surface of the face model, with an efficient switchable online adaptation that gradually captures the identity of the tracked subject and rapidly constructs a suitable face model when the subject changes. Moreover, unlike prior art that employed ICP-based facial pose estimation, to improve robustness to occlusions, we propose a ray visibility constraint that regularizes the pose based on the face model's visibility with respect to the input point cloud. Ablation studies and experimental results on Biwi and ICT-3DHP datasets demonstrate that the proposed framework is effective and outperforms completing state-of-the-art depth-based methods.
In unsupervised domain adaptation, it is widely known that the target domain error can be provably reduced by having a shared input representation that makes the source and target domains indistinguishable from each other. Very recently it has been studied that not just matching the marginal input distributions, but the alignment of output (class) distributions is also critical. The latter can be achieved by minimizing the maximum discrepancy of predictors (classifiers). In this paper, we adopt this principle, but propose a more systematic and effective way to achieve hypothesis consistency via Gaussian processes (GP). The GP allows us to define/induce a hypothesis space of the classifiers from the posterior distribution of the latent random functions, turning the learning into a simple large-margin posterior separation problem, far easier to solve than previous approaches based on adversarial minimax optimization. We formulate a learning objective that effectively pushes the posterior to minimize the maximum discrepancy. This is further shown to be equivalent to maximizing margins and minimizing uncertainty of the class predictions in the target domain, a well-established principle in classical (semi-)supervised learning. Empirical results demonstrate that our approach is comparable or superior to the existing methods on several benchmark domain adaptation datasets.
We propose a novel VAE-based deep auto-encoder model that can learn disentangled latent representations in a fully unsupervised manner, endowed with the ability to identify all meaningful sources of variation and their cardinality. Our model, dubbed Relevance-Factor-VAE, leverages the total correlation (TC) in the latent space to achieve the disentanglement goal, but also addresses the key issue of existing approaches which cannot distinguish between meaningful and nuisance factors of latent variation, often the source of considerable degradation in disentanglement performance. We tackle this issue by introducing the so-called relevance indicator variables that can be automatically learned from data, together with the VAE parameters. Our model effectively focuses the TC loss onto the relevant factors only by tolerating large prior KL divergences, a desideratum justified by our semi-parametric theoretical analysis. Using a suite of disentanglement metrics, including a newly proposed one, as well as qualitative evidence, we demonstrate that our model outperforms existing methods across several challenging benchmark datasets.
Unsupervised domain adaptation (uDA) models focus on pairwise adaptation settings where there is a single, labeled, source and a single target domain. However, in many real-world settings one seeks to adapt to multiple, but somewhat similar, target domains. Applying pairwise adaptation approaches to this setting may be suboptimal, as they fail to leverage shared information among multiple domains. In this work we propose an information theoretic approach for domain adaptation in the novel context of multiple target domains with unlabeled instances and one source domain with labeled instances. Our model aims to find a shared latent space common to all domains, while simultaneously accounting for the remaining private, domain-specific factors. Disentanglement of shared and private information is accomplished using a unified information-theoretic approach, which also serves to establish a stronger link between the latent representations and the observed data. The resulting model, accompanied by an efficient optimization algorithm, allows simultaneous adaptation from a single source to multiple target domains. We test our approach on three challenging publicly-available datasets, showing that it outperforms several popular domain adaptation methods.
We address the problem of using hand-drawn sketches to edit the facial identity, such as enlarging the shape or modifying the position of eyes or mouth, in the entire video. This task is formulated as a 3D face model reconstruction and deformation problem. We first introduce a two-stage real-time 3D face model fitting schema to recover the facial identity and expressions from the video. User's editing intention is recognized from input sketches as a set of facial modifications. Then a novel identity deformation algorithm is proposed to transfer these facial deformations from 2D space to the 3D facial identity directly, while preserving the facial expressions. After an optional stage for further refining the 3D face model, these changes are propagated to the whole video with the modified identity. Both the user study and experimental results demonstrate that our sketching framework can help users effectively edit facial identities in videos, while high consistency and fidelity are ensured at the same time.
This paper presents Generative Adversarial Talking Head (GATH), a novel deep generative neural network that enables fully automatic facial expression synthesis of an arbitrary portrait with continuous action unit (AU) coefficients. Specifically, our model directly manipulates image pixels to make the unseen subject in the still photo express various emotions controlled by values of facial AU coefficients, while maintaining her personal characteristics, such as facial geometry, skin color and hair style, as well as the original surrounding background. In contrast to prior work, GATH is purely data-driven and it requires neither a statistical face model nor image processing tricks to enact facial deformations. Additionally, our model is trained from unpaired data, where the input image, with its auxiliary identity label taken from abundance of still photos in the wild, and the target frame are from different persons. In order to effectively learn such model, we propose a novel weakly supervised adversarial learning framework that consists of a generator, a discriminator, a classifier and an action unit estimator. Our work gives rise to template-and-target-free expression editing, where still faces can be effortlessly animated with arbitrary AU coefficients provided by the user.
We present a deep learning framework for real-time speech-driven 3D facial animation from just raw waveforms. Our deep neural network directly maps an input sequence of speech audio to a series of micro facial action unit activations and head rotations to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or in a surprised state, both eyebrows raise higher. Experiments on a diverse audiovisual corpus of different actors across a wide range of emotional states show interesting and promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.