Although 2D generative models have made great progress in face image generation and animation, they often suffer from undesirable artifacts such as 3D inconsistency when rendering images from different camera viewpoints. This prevents them from synthesizing video animations indistinguishable from real ones. Recently, 3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose by leveraging 3D scene representations. These methods can well preserve the 3D consistency of the generated images across different views, yet they cannot achieve fine-grained control over other attributes, among which facial expression control is arguably the most useful and desirable for face animation. In this paper, we propose an animatable 3D-aware GAN for multiview consistent face animation generation. The key idea is to decompose the 3D representation of the 3D-aware GAN into a template field and a deformation field, where the former represents different identities with a canonical expression, and the latter characterizes expression variations of each identity. To achieve meaningful control over facial expressions via deformation, we propose a 3D-level imitative learning scheme between the generator and a parametric 3D face model during adversarial training of the 3D-aware GAN. This helps our method achieve high-quality animatable face image generation with strong visual 3D consistency, even though trained with only unstructured 2D images. Extensive experiments demonstrate our superior performance over prior works. Project page: https://yuewuhkust.github.io/AniFaceGAN
The empirical studies of Graph Neural Networks (GNNs) broadly take the original node feature and adjacency relationship as singleview input, ignoring the rich information of multiple graph views. To circumvent this issue, the multiview graph analysis framework has been developed to fuse graph information across views. How to model and integrate shared (i.e. consistency) and view-specific (i.e. complementarity) information is a key issue in multiview graph analysis. In this paper, we propose a novel Multiview Variational Graph Information Bottleneck (MVGIB) principle to maximize the agreement for common representations and the disagreement for view-specific representations. Under this principle, we formulate the common and view-specific information bottleneck objectives across multiviews by using constraints from mutual information. However, these objectives are hard to directly optimize since the mutual information is computationally intractable. To tackle this challenge, we derive variational lower and upper bounds of mutual information terms, and then instead optimize variational bounds to find the approximate solutions for the information objectives. Extensive experiments on graph benchmark datasets demonstrate the superior effectiveness of the proposed method.
Can agents be trained to answer difficult mathematical questions by playing a game? We consider the integer feasibility problem, a challenge of deciding whether a system of linear equations and inequalities has a solution with integer values. This is a famous NP-complete problem with applications in many areas of Mathematics and Computer Science. Our paper describes a novel algebraic reinforcement learning framework that allows an agent to play a game equivalent to the integer feasibility problem. We explain how to transform the integer feasibility problem into a game over a set of arrays with fixed margin sums. The game starts with an initial state (an array), and by applying a legal move that leaves the margins unchanged, we aim to eventually reach a winning state with zeros in specific positions. To win the game the player must find a path between the initial state and a final terminal winning state if one exists. Finding such a winning state is equivalent to solving the integer feasibility problem. The key algebraic ingredient is a Gr\"obner basis of the toric ideal for the underlying axial transportation polyhedron. The Gr\"obner basis can be seen as a set of connecting moves (actions) of the game. We then propose a novel RL approach that trains an agent to predict moves in continuous space to cope with the large size of action space. The continuous move is then projected onto the set of legal moves so that the path always leads to valid states. As a proof of concept we demonstrate in experiments that our agent can play well the simplest version of our game for 2-way tables. Our work highlights the potential to train agents to solve non-trivial mathematical queries through contemporary machine learning methods used to train agents to play games.
The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. To further understand this, we consider a challenging classification problem with intrinsic cluster structures, which is hard to learn using a single expert. Yet with the MoE layer, by choosing the experts as two-layer nonlinear convolutional neural networks (CNNs), we show that the problem can be learned successfully. Furthermore, our theory shows that the router can learn the cluster-center features, which helps divide the input complex problem into simpler linear classification sub-problems that individual experts can conquer. To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.
A surge of interest in Graph Convolutional Networks (GCN) has produced thousands of GCN variants, with hundreds introduced every year. In contrast, many GCN models re-use only a handful of benchmark datasets as many graphs of interest, such as social or commercial networks, are proprietary. We propose a new graph generation problem to enable generating a diverse set of benchmark graphs for GCNs following the distribution of a source graph -- possibly proprietary -- with three requirements: 1) benchmark effectiveness as a substitute for the source graph for GCN research, 2) scalability to process large-scale real-world graphs, and 3) a privacy guarantee for end-users. With a novel graph encoding scheme, we reframe large-scale graph generation problem into medium-length sequence generation problem and apply the strong generation power of the Transformer architecture to the graph domain. Extensive experiments across a vast body of graph generative models show that our model can successfully generate benchmark graphs with the realistic graph structure, node attributes, and node labels required to benchmark GCNs on node classification tasks.
Video prediction is an extrapolation task that predicts future frames given past frames, and video frame interpolation is an interpolation task that estimates intermediate frames between two frames. We have witnessed the tremendous advancement of video frame interpolation, but the general video prediction in the wild is still an open question. Inspired by the photo-realistic results of video frame interpolation, we present a new optimization framework for video prediction via video frame interpolation, in which we solve an extrapolation problem based on an interpolation model. Our video prediction framework is based on optimization with a pretrained differentiable video frame interpolation module without the need for a training dataset, and thus there is no domain gap issue between training and test data. Also, our approach does not need any additional information such as semantic or instance maps, which makes our framework applicable to any video. Extensive experiments on the Cityscapes, KITTI, DAVIS, Middlebury, and Vimeo90K datasets show that our video prediction results are robust in general scenarios, and our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, \emph{Geometric Policy Iteration} (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\bigO{\frac{|\actions|}{1 - \gamma}\log \frac{1}{1-\gamma}}$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.
Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by "prompt", e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity measure between the image and the prompt sentence "a photo of a [CLASS]". Therefore, prompt shows a great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may not only undermine the prompt's inherent prediction for the task-related classes, but also for other classes in the VLM vocabulary. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompt. We present Prompt-aligned Gradient, dubbed ProGrad, to prevent prompt tuning from forgetting the the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the "general direction", which is represented as the gradient of the KL loss of the pre-defined prompt prediction. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes are available at https://github.com/BeierZhu/Prompt-align.
Registration of multi-view point clouds is fundamental in 3D reconstruction. Since there are close connections between point clouds captured from different viewpoints, registration performance can be enhanced if these connections be harnessed properly. Therefore, this paper models the registration problem as multi-task optimization, and proposes a novel bi-channel knowledge sharing mechanism for effective and efficient problem solving. The modeling of multi-view point cloud registration as multi-task optimization are twofold. By simultaneously considering the local accuracy of two point clouds as well as the global consistency posed by all the point clouds involved, a fitness function with an adaptive threshold is derived. Also a framework of the co-evolutionary search process is defined for the concurrent optimization of multiple fitness functions belonging to related tasks. To enhance solution quality and convergence speed, the proposed bi-channel knowledge sharing mechanism plays its role. The intra-task knowledge sharing introduces aiding tasks that are much simpler to solve, and useful information is shared within tasks, accelerating the search process. The inter-task knowledge sharing explores commonalities buried among tasks, aiming to prevent tasks from getting stuck to local optima. Comprehensive experiments conducted on model object as well as scene point clouds show the efficacy of the proposed method.
Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer the lack of manual annotation for deep supervised learning, this paper presents a first study on utilizing semi-supervised object detection (SSOD) for lesion localization in OCT images. To that end, we develop a taxonomy to provide a unified and structured viewpoint of the current SSOD methods, and consequently identify key modules in these methods. To evaluate the influence of these modules in the new task, we build OCT-SS, a new dataset consisting of over 1k expert-labeled OCT B-scan images and over 13k unlabeled B-scans. Extensive experiments on OCT-SS identify Unbiased Teacher (UnT) as the best current SSOD method for lesion localization. Moreover, we improve over this strong baseline, with mAP increased from 49.34 to 50.86.