Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kostas Daniilidis

Joint Estimation of Image Representations and their Lie Invariants

Dec 08, 2020
Christine Allen-Blanchette, Kostas Daniilidis

Figure 1 for Joint Estimation of Image Representations and their Lie Invariants

Figure 2 for Joint Estimation of Image Representations and their Lie Invariants

Figure 3 for Joint Estimation of Image Representations and their Lie Invariants

Images encode both the state of the world and its content. The former is useful for tasks such as planning and control, and the latter for classification. The automatic extraction of this information is challenging because of the high-dimensionality and entangled encoding inherent to the image representation. This article introduces two theoretical approaches aimed at the resolution of these challenges. The approaches allow for the interpolation and extrapolation of images from an image sequence by joint estimation of the image representation and the generators of the sequence dynamics. In the first approach, the image representations are learned using probabilistic PCA \cite{tipping1999probabilistic}. The linear-Gaussian conditional distributions allow for a closed form analytical description of the latent distributions but assumes the underlying image manifold is a linear subspace. In the second approach, the image representations are learned using probabilistic nonlinear PCA which relieves the linear manifold assumption at the cost of requiring a variational approximation of the latent distributions. In both approaches, the underlying dynamics of the image sequence are modelled explicitly to disentangle them from the image representations. The dynamics themselves are modelled with Lie group structure which enforces the desirable properties of smoothness and composability of inter-image transformations.

* Resolves typographical errors

Via

Access Paper or Ask Questions

Learning Portrait Style Representations

Dec 08, 2020
Sadat Shaik, Bernadette Bucher, Nephele Agrafiotis, Stephen Phillips, Kostas Daniilidis, William Schmenner

Figure 1 for Learning Portrait Style Representations

Figure 2 for Learning Portrait Style Representations

Figure 3 for Learning Portrait Style Representations

Figure 4 for Learning Portrait Style Representations

Style analysis of artwork in computer vision predominantly focuses on achieving results in target image generation through optimizing understanding of low level style characteristics such as brush strokes. However, fundamentally different techniques are required to computationally understand and control qualities of art which incorporate higher level style characteristics. We study style representations learned by neural network architectures incorporating these higher level characteristics. We find variation in learned style features from incorporating triplets annotated by art historians as supervision for style similarity. Networks leveraging statistical priors or pretrained on photo collections such as ImageNet can also derive useful visual representations of artwork. We align the impact of these expert human knowledge, statistical, and photo realism priors on style representations with art historical research and use these representations to perform zero-shot classification of artists. To facilitate this work, we also present the first large-scale dataset of portraits prepared for computational analysis.

* Sadat Shaik and Bernadette Bucher contributed equally

Via

Access Paper or Ask Questions

Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Nov 12, 2020
Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn

Figure 1 for Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Figure 2 for Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Figure 3 for Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Figure 4 for Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Reinforcement learning is a powerful framework for robots to acquire skills from experience, but often requires a substantial amount of online data collection. As a result, it is difficult to collect sufficiently diverse experiences that are needed for robots to generalize broadly. Videos of humans, on the other hand, are a readily available source of broad and interesting experiences. In this paper, we consider the question: can we perform reinforcement learning directly on experience collected by humans? This problem is particularly difficult, as such videos are not annotated with actions and exhibit substantial visual domain shift relative to the robot's embodiment. To address these challenges, we propose a framework for reinforcement learning with videos (RLV). RLV learns a policy and value function using experience collected by humans in combination with data collected by robots. In our experiments, we find that RLV is able to leverage such videos to learn challenging vision-based skills with less than half as many samples as RL methods that learn from scratch.

Via

Access Paper or Ask Questions

3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View

Aug 13, 2020
Marc Badger, Yufu Wang, Adarsh Modh, Ammon Perkes, Nikos Kolotouros, Bernd G. Pfrommer, Marc F. Schmidt, Kostas Daniilidis

Figure 1 for 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View

Figure 2 for 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View

Figure 3 for 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View

Figure 4 for 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View

Automated capture of animal pose is transforming how we study neuroscience and social behavior. Movements carry important social cues, but current methods are not able to robustly estimate pose and shape of animals, particularly for social animals such as birds, which are often occluded by each other and objects in the environment. To address this problem, we first introduce a model and multi-view optimization approach, which we use to capture the unique shape and pose space displayed by live birds. We then introduce a pipeline and experiments for keypoint, mask, pose, and shape regression that recovers accurate avian postures from single views. Finally, we provide extensive multi-view keypoint and mask annotations collected from a group of 15 social birds housed together in an outdoor aviary. The project website with videos, results, code, mesh model, and the Penn Aviary Dataset can be found at https://marcbadger.github.io/avian-mesh.

* In ECCV 2020

Via

Access Paper or Ask Questions

TLIO: Tight Learned Inertial Odometry

Jul 10, 2020
Wenxin Liu, David Caruso, Eddy Ilg, Jing Dong, Anastasios I. Mourikis, Kostas Daniilidis, Vijay Kumar, Jakob Engel

Figure 1 for TLIO: Tight Learned Inertial Odometry

Figure 2 for TLIO: Tight Learned Inertial Odometry

Figure 3 for TLIO: Tight Learned Inertial Odometry

Figure 4 for TLIO: Tight Learned Inertial Odometry

In this work we propose a tightly-coupled Extended Kalman Filter framework for IMU-only state estimation. Strap-down IMU measurements provide relative state estimates based on IMU kinematic motion model. However the integration of measurements is sensitive to sensor bias and noise, causing significant drift within seconds. Recent research by Yan et al. (RoNIN) and Chen et al. (IONet) showed the capability of using trained neural networks to obtain accurate 2D displacement estimates from segments of IMU data and obtained good position estimates from concatenating them. This paper demonstrates a network that regresses 3D displacement estimates and its uncertainty, giving us the ability to tightly fuse the relative state measurement into a stochastic cloning EKF to solve for pose, velocity and sensor biases. We show that our network, trained with pedestrian data from a headset, can produce statistically consistent measurement and uncertainty to be used as the update step in the filter, and the tightly-coupled system outperforms velocity integration approaches in position estimates, and AHRS attitude filter in orientation estimates.

* Correcting graph and bibliography. Adding journal reference information and DOI, in IEEE Robotics and Automation Letters

Via

Access Paper or Ask Questions

Simple and Effective VAE Training with Calibrated Decoders

Jun 23, 2020
Oleh Rybkin, Kostas Daniilidis, Sergey Levine

Figure 1 for Simple and Effective VAE Training with Calibrated Decoders

Figure 2 for Simple and Effective VAE Training with Calibrated Decoders

Figure 3 for Simple and Effective VAE Training with Calibrated Decoders

Figure 4 for Simple and Effective VAE Training with Calibrated Decoders

Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions. However, training VAEs often requires considerable hyperparameter tuning, and often utilizes a heuristic weight on the prior KL-divergence term. In this work, we study how the performance of VAEs can be improved while not requiring the use of this heuristic hyperparameter, by learning calibrated decoders that accurately model the decoding distribution. While in some sense it may seem obvious that calibrated decoders should perform better than uncalibrated decoders, much of the recent literature that employs VAEs uses uncalibrated Gaussian decoders with constant variance. We observe empirically that the na\"{i}ve way of learning variance in Gaussian decoders does not lead to good results. However, {other calibrated decoders, such as discrete decoders or learning shared variance} can substantially improve performance. To further improve results, we propose a simple but novel modification to the commonly used Gaussian decoder, which represents the prediction variance non-parametrically. We observe empirically that using the heuristic weight hyperparameter is not necessary with our method. We analyze the performance of various discrete and continuous decoders on a range of datasets and several single-image and sequential VAE models. Project website: \url{https://orybkin.github.io/sigma-vae/}

* Project website: \url{https://orybkin.github.io/sigma-vae/}

Via

Access Paper or Ask Questions

Spin-Weighted Spherical CNNs

Jun 18, 2020
Carlos Esteves, Ameesh Makadia, Kostas Daniilidis

Figure 1 for Spin-Weighted Spherical CNNs

Figure 2 for Spin-Weighted Spherical CNNs

Figure 3 for Spin-Weighted Spherical CNNs

Figure 4 for Spin-Weighted Spherical CNNs

Learning equivariant representations is a promising way to reduce sample and model complexity and improve the generalization performance of deep neural networks. The spherical CNNs are successful examples, producing SO(3)-equivariant representations of spherical inputs. There are two main types of spherical CNNs. The first type lifts the inputs to functions on the rotation group SO(3) and applies convolutions on the group, which are computationally expensive since SO(3) has one extra dimension. The second type applies convolutions directly on the sphere, which are limited to zonal (isotropic) filters, and thus have limited expressivity. In this paper, we present a new type of spherical CNN that allows anisotropic filters in an efficient way, without ever leaving the spherical domain. The key idea is to consider spin-weighted spherical functions, which were introduced in physics in the study of gravitational waves. These are complex-valued functions on the sphere whose phases change upon rotation. We define a convolution between spin-weighted functions and build a CNN based on it. Experiments show that our method outperforms the isotropic spherical CNNs while still being much more efficient than using SO(3) convolutions. The spin-weighted functions can also be interpreted as spherical vector fields, allowing applications to tasks where the inputs or outputs are vector fields.

Via

Access Paper or Ask Questions

Coherent Reconstruction of Multiple Humans from a Single Image

Jun 15, 2020
Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, Kostas Daniilidis

Figure 1 for Coherent Reconstruction of Multiple Humans from a Single Image

Figure 2 for Coherent Reconstruction of Multiple Humans from a Single Image

Figure 3 for Coherent Reconstruction of Multiple Humans from a Single Image

Figure 4 for Coherent Reconstruction of Multiple Humans from a Single Image

In this work, we address the problem of multi-person 3D pose estimation from a single image. A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently. However, this type of prediction suffers from incoherent results, e.g., interpenetration and inconsistent depth ordering between the people in the scene. Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene. To this end, a key design choice is the incorporation of the SMPL parametric body model in our top-down framework, which enables the use of two novel losses. First, a distance field-based collision loss penalizes interpenetration among the reconstructed people. Second, a depth ordering-aware loss reasons about occlusions and promotes a depth ordering of people that leads to a rendering which is consistent with the annotated instance segmentation. This provides depth supervision signals to the network, even if the image has no explicit 3D annotations. The experiments show that our approach outperforms previous methods on standard 3D pose benchmarks, while our proposed losses enable more coherent reconstruction in natural images. The project website with videos, results, and code can be found at: https://jiangwenpl.github.io/multiperson

* CVPR 2020. Project Page: https://jiangwenpl.github.io/multiperson/

Via

Access Paper or Ask Questions