Dr. Ing. h.c. F. Porsche AG
Abstract:Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.
Abstract:3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category-specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image-conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object pose and semantic correspondence. Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
Abstract:Normalizing flows have proven their efficacy for density estimation in Euclidean space, but their application to rotational representations, crucial in various domains such as robotics or human pose modeling, remains underexplored. Probabilistic models of the human pose can benefit from approaches that rigorously consider the rotational nature of human joints. For this purpose, we introduce HuProSO3, a normalizing flow model that operates on a high-dimensional product space of SO(3) manifolds, modeling the joint distribution for human joints with three degrees of freedom. HuProSO3's advantage over state-of-the-art approaches is demonstrated through its superior modeling accuracy in three different applications and its capability to evaluate the exact likelihood. This work not only addresses the technical challenge of learning densities on SO(3) manifolds, but it also has broader implications for domains where the probabilistic regression of correlated 3D rotations is of importance.
Abstract:Neural networks reach state-of-the-art performance in a variety of learning tasks. However, a lack of understanding the decision making process yields to an appearance as black box. We address this and propose ConstraintNet, a neural network with the capability to constrain the output space in each forward pass via an additional input. The prediction of ConstraintNet is proven within the specified domain. This enables ConstraintNet to exclude unintended or even hazardous outputs explicitly whereas the final prediction is still learned from data. We focus on constraints in form of convex polytopes and show the generalization to further classes of constraints. ConstraintNet can be constructed easily by modifying existing neural network architectures. We highlight that ConstraintNet is end-to-end trainable with no overhead in the forward and backward pass. For illustration purposes, we model ConstraintNet by modifying a CNN and construct constraints for facial landmark prediction tasks. Furthermore, we demonstrate the application to a follow object controller for vehicles as a safety-critical application. We submitted an approach and system for the generation of safety-critical outputs of an entity based on ConstraintNet at the German Patent and Trademark Office with the official registration mark DE10 2019 119 739.