Graph Neural Networks (GNNs) have achieved great success on a variety of tasks with graph-structural data, among which node classification is an essential one. Unsupervised Graph Domain Adaptation (UGDA) shows its practical value of reducing the labeling cost for node classification. It leverages knowledge from a labeled graph (i.e., source domain) to tackle the same task on another unlabeled graph (i.e., target domain). Most existing UGDA methods heavily rely on the labeled graph in the source domain. They utilize labels from the source domain as the supervision signal and are jointly trained on both the source graph and the target graph. However, in some real-world scenarios, the source graph is inaccessible because of either unavailability or privacy issues. Therefore, we propose a novel scenario named Source Free Unsupervised Graph Domain Adaptation (SFUGDA). In this scenario, the only information we can leverage from the source domain is the well-trained source model, without any exposure to the source graph and its labels. As a result, existing UGDA methods are not feasible anymore. To address the non-trivial adaptation challenges in this practical scenario, we propose a model-agnostic algorithm for domain adaptation to fully exploit the discriminative ability of the source model while preserving the consistency of structural proximity on the target graph. We prove the effectiveness of the proposed algorithm both theoretically and empirically. The experimental results on four cross-domain tasks show consistent improvements of the Macro-F1 score up to 0.17.
Capturing the dynamically deforming 3D shape of clothed human is essential for numerous applications, including VR/AR, autonomous driving, and human-computer interaction. Existing methods either require a highly specialized capturing setup, such as expensive multi-view imaging systems, or they lack robustness to challenging body poses. In this work, we propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses, without any additional input. We first build a 3D template human model of the subject based on a learned regression model. We then track this template model's deformation under challenging body articulations based on 2D image observations. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW. Moreover, we demonstrate its efficacy in robustness and generalizability on videos from iPER datasets.
Regularization can mitigate the generalization gap between training and inference by introducing inductive bias. Existing works have already proposed various inductive biases from diverse perspectives. However, to the best of our knowledge, none of them explores inductive bias from the perspective of class-dependent response distribution of individual neurons. In this paper, we conduct a substantial analysis of the characteristics of such distribution. Based on the analysis results, we articulate the Neuron Steadiness Hypothesis: the neuron with similar responses to instances of the same class leads to better generalization. Accordingly, we propose a new regularization method called Neuron Steadiness Regularization to reduce neuron intra-class response variance. We conduct extensive experiments on Multilayer Perceptron, Convolutional Neural Network, and Graph Neural Network with popular benchmark datasets of diverse domains, which show that our Neuron Steadiness Regularization consistently outperforms the vanilla version of models with significant gain and low additional overhead.
This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.
Incorporating social relations into the recommendation system, i.e. social recommendation, has been widely studied in academic and industrial communities. While many promising results have been achieved, existing methods mostly assume that the social relations can be homogeneously applied to all the items, which is not practical for users' actually diverse preferences. In this paper, we argue that the effect of the social relations should be inhomogeneous, that is, two socially-related users may only share the same preference on some specific items, while for the other products, their preferences can be inconsistent or even contradictory. Inspired by this idea, we build a novel social recommendation model, where the traditional pair-wise "user-user'' relation is extended to the triple relation of "user-item-user''. To well handle such high-order relations, we base our framework on the hypergraph. More specifically, each hyperedge connects a user-user-item triplet, representing that the two users share similar preferences on the item. We develop a Social HyperGraph Convolutional Network (short for SHGCN) to learn from the complex triplet social relations. With the hypergraph convolutional networks, the social relations can be modeled in a more fine-grained manner, which more accurately depicts real users' preferences, and benefits the recommendation performance. Extensive experiments on two real-world datasets demonstrate our model's effectiveness. Studies on data sparsity and hyper-parameter studies further validate our model's rationality. Our codes and dataset are available at https://github.com/ziruizhu/SHGCN.
Upsampling videos of human activity is an interesting yet challenging task with many potential applications ranging from gaming to entertainment and sports broadcasting. The main difficulty in synthesizing video frames in this setting stems from the highly complex and non-linear nature of human motion and the complex appearance and texture of the body. We propose to address these issues in a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset (AMASS). The high-frame-rate pose predictions are then used by a neural rendering pipeline to produce the full-frame output, taking the pose and background consistency into consideration. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training. Furthermore, we contribute the first evaluation dataset that consists of high-quality and high-frame-rate videos of human activities for this task. Compared with state-of-the-art video interpolation techniques, our method produces in-between frames with better quality and accuracy, which is evident by state-of-the-art results on pixel-level, distributional metrics and comparative user evaluations. Our code and the collected dataset are available at https://git.io/Render-In-Between.
We propose a multi-stage coarse-to-fine CNN-based framework, called SkullEngine, for high-resolution segmentation and large-scale landmark detection through a collaborative, integrated, and scalable JSD model and three segmentation and landmark detection refinement models. We evaluated our framework on a clinical dataset consisting of 170 CBCT/CT images for the task of segmenting 2 bones (midface and mandible) and detecting 175 clinically common landmarks on bones, teeth, and soft tissues.
In this paper we contribute a simple yet effective approach for estimating 3D poses of multiple people from multi-view images. Our proposed coarse-to-fine pipeline first aggregates noisy 2D observations from multiple camera views into 3D space and then associates them into individual instances based on a confidence-aware majority voting technique. The final pose estimates are attained from a novel optimization scheme which links high-confidence multi-view 2D observations and 3D joint candidates. Moreover, a statistical parametric body model such as SMPL is leveraged as a regularizing prior for these 3D joint candidates. Specifically, both 3D poses and SMPL parameters are optimized jointly in an alternating fashion. Here the parametric models help in correcting implausible 3D pose estimates and filling in missing joint detections while updated 3D poses in turn guide obtaining better SMPL estimations. By linking 2D and 3D observations, our method is both accurate and generalizes to different data sources because it better decouples the final 3D pose from the inter-person constellation and is more robust to noisy 2D detections. We systematically evaluate our method on public datasets and achieve state-of-the-art performance. The code and video will be available on the project page: https://ait.ethz.ch/projects/2021/multi-human-pose/.