Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Francesc Moreno-Noguer

Back to MLP: A Simple Baseline for Human Motion Prediction

Jul 04, 2022

Wen Guo, Yuming Du, Xi Shen, Vincent Lepetit, Xavier Alameda-Pineda, Francesc Moreno-Noguer

Figure 1 for Back to MLP: A Simple Baseline for Human Motion Prediction

Figure 2 for Back to MLP: A Simple Baseline for Human Motion Prediction

Figure 3 for Back to MLP: A Simple Baseline for Human Motion Prediction

Figure 4 for Back to MLP: A Simple Baseline for Human Motion Prediction

Abstract:This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences. Despite of their performance, current state-of-the-art approaches rely on deep learning architectures of arbitrary complexity, such as Recurrent Neural Networks~(RNN), Transformers or Graph Convolutional Networks~(GCN), typically requiring multiple training stages and more than 3 million of parameters. In this paper we show that the performance of these approaches can be surpassed by a light-weight and purely MLP architecture with only 0.14M parameters when appropriately combined with several standard practices such as representing the body pose with Discrete Cosine Transform (DCT), predicting residual displacement of joints and optimizing velocity as an auxiliary loss. An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches. We hope that our simple method could serve a strong baseline to the community and allow re-thinking the problem of human motion prediction and whether current benchmarks do really need intricate architectural designs. Our code is available at \url{https://github.com/dulucas/siMLPe}.

* Tech report. Code available at https://github.com/dulucas/siMLPe

Via

Access Paper or Ask Questions

Learned Vertex Descent: A New Direction for 3D Human Model Fitting

May 12, 2022

Enric Corona, Gerard Pons-Moll, Guillem Alenyà, Francesc Moreno-Noguer

Figure 1 for Learned Vertex Descent: A New Direction for 3D Human Model Fitting

Figure 2 for Learned Vertex Descent: A New Direction for 3D Human Model Fitting

Figure 3 for Learned Vertex Descent: A New Direction for 3D Human Model Fitting

Figure 4 for Learned Vertex Descent: A New Direction for 3D Human Model Fitting

Abstract:We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e.g. SMPL) from input images, we train an ensemble of per-vertex neural fields network. The network predicts, in a distributed manner, the vertex descent direction towards the ground truth, based on neural features extracted at the current vertex projection. At inference, we employ this network, dubbed LVD, within a gradient-descent optimization pipeline until its convergence, which typically occurs in a fraction of a second even when initializing all vertices into a single point. An exhaustive evaluation demonstrates that our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.

* Project page: https://www.iri.upc.edu/people/ecorona/lvd/

Via

Access Paper or Ask Questions

Single-view 3D Body and Cloth Reconstruction under Complex Poses

May 09, 2022

Nicolas Ugrinovic, Albert Pumarola, Alberto Sanfeliu, Francesc Moreno-Noguer

Figure 1 for Single-view 3D Body and Cloth Reconstruction under Complex Poses

Figure 2 for Single-view 3D Body and Cloth Reconstruction under Complex Poses

Figure 3 for Single-view 3D Body and Cloth Reconstruction under Complex Poses

Figure 4 for Single-view 3D Body and Cloth Reconstruction under Complex Poses

Abstract:Recent advances in 3D human shape reconstruction from single images have shown impressive results, leveraging on deep networks that model the so-called implicit function to learn the occupancy status of arbitrarily dense 3D points in space. However, while current algorithms based on this paradigm, like PiFuHD, are able to estimate accurate geometry of the human shape and clothes, they require high-resolution input images and are not able to capture complex body poses. Most training and evaluation is performed on 1k-resolution images of humans standing in front of the camera under neutral body poses. In this paper, we leverage publicly available data to extend existing implicit function-based models to deal with images of humans that can have arbitrary poses and self-occluded limbs. We argue that the representation power of the implicit function is not sufficient to simultaneously model details of the geometry and of the body pose. We, therefore, propose a coarse-to-fine approach in which we first learn an implicit function that maps the input image to a 3D body shape with a low level of detail, but which correctly fits the underlying human pose, despite its complexity. We then learn a displacement map, conditioned on the smoothed surface and on the input image, which encodes the high-frequency details of the clothes and body. In the experimental section, we show that this coarse-to-fine strategy represents a very good trade-off between shape detail and pose correctness, comparing favorably to the most recent state-of-the-art approaches. Our code will be made publicly available.

Via

Access Paper or Ask Questions

Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

Apr 11, 2022

Nicolas Ugrinovic, Adria Ruiz, Antonio Agudo, Alberto Sanfeliu, Francesc Moreno-Noguer

Figure 1 for Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

Figure 2 for Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

Figure 3 for Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

Figure 4 for Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

Abstract:Recovering multi-person 3D poses from a single RGB image is a severely ill-conditioned problem due not only to the inherent 2D-3D depth ambiguity but also because of inter-person occlusions and body truncations. Recent works have shown promising results by simultaneously reasoning for different people but in all cases within a local neighborhood. An interesting exception is PI-Net, which introduces a self-attention block to reason for all people in the image at the same time and refine potentially noisy initial 3D poses. However, the proposed methodology requires defining one of the individuals as a reference, and the outcome of the algorithm is sensitive to this choice. In this paper, we model people interactions at a whole, independently of their number, and in a permutation-invariant manner building upon the Set Transformer. We leverage on this representation to refine the initial 3D poses estimated by off-the-shelf detectors. A thorough evaluation demonstrates that our approach is able to boost the performance of the initially estimated 3D poses by large margins, achieving state-of-the-art results on MuPoTS-3D, CMU Panoptic and NBA2K datasets. Additionally, the proposed module is computationally efficient and can be used as a drop-in complement for any 3D pose detector in multi-people scenes.

Via

Access Paper or Ask Questions

LISA: Learning Implicit Shape and Appearance of Hands

Apr 04, 2022

Enric Corona, Tomas Hodan, Minh Vo, Francesc Moreno-Noguer, Chris Sweeney, Richard Newcombe, Lingni Ma

Figure 1 for LISA: Learning Implicit Shape and Appearance of Hands

Figure 2 for LISA: Learning Implicit Shape and Appearance of Hands

Figure 3 for LISA: Learning Implicit Shape and Appearance of Hands

Figure 4 for LISA: Learning Implicit Shape and Appearance of Hands

Abstract:This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild and easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coarse 3D poses of the hand skeleton. For a 3D point in the hand local coordinate, our model predicts the color and the signed distance with respect to each hand bone independently, and then combines the per-bone predictions using predicted skinning weights. The shape, color and pose representations are disentangled by design, allowing to estimate or animate only selected parameters. We experimentally demonstrate that LISA can accurately reconstruct a dynamic hand from monocular or multi-view sequences, achieving a noticeably higher quality of reconstructed hand shapes compared to baseline approaches. Project page: https://www.iri.upc.edu/people/ecorona/lisa/.

* Published at CVPR 2022

Via

Access Paper or Ask Questions

HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Apr 04, 2022

Xiaoyu Bie, Wen Guo, Simon Leglaive, Lauren Girin, Francesc Moreno-Noguer, Xavier Alameda-Pineda

Figure 1 for HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Figure 2 for HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Figure 3 for HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Figure 4 for HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Abstract:Studies on the automatic processing of 3D human pose data have flourished in the recent past. In this paper, we are interested in the generation of plausible and diverse future human poses following an observed 3D pose sequence. Current methods address this problem by injecting random variables from a single latent space into a deterministic motion prediction framework, which precludes the inherent multi-modality in human motion generation. In addition, previous works rarely explore the use of attention to select which frames are to be used to inform the generation process up to our knowledge. To overcome these limitations, we propose Hierarchical Transformer Dynamical Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms. HiT-DVAE simultaneously learns the evolution of data and latent space distribution with time correlated probabilistic dependencies, thus enabling the generative model to learn a more complex and time-varying latent space as well as diverse and realistic human motions. Furthermore, the auto-regressive generation brings more flexibility on observation and prediction, i.e. one can have any length of observation and predict arbitrary large sequences of poses with a single pre-trained model. We evaluate the proposed method on HumanEva-I and Human3.6M with various evaluation methods, and outperform the state-of-the-art methods on most of the metrics.

Via

Access Paper or Ask Questions

Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification

Mar 18, 2022

Jianxiong Shen, Antonio Agudo, Francesc Moreno-Noguer, Adria Ruiz

Figure 1 for Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification

Figure 2 for Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification

Figure 3 for Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification

Figure 4 for Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification

Abstract:A critical limitation of current methods based on Neural Radiance Fields (NeRF) is that they are unable to quantify the uncertainty associated with the learned appearance and geometry of the scene. This information is paramount in real applications such as medical diagnosis or autonomous driving where, to reduce potentially catastrophic failures, the confidence on the model outputs must be included into the decision-making process. In this context, we introduce Conditional-Flow NeRF (CF-NeRF), a novel probabilistic framework to incorporate uncertainty quantification into NeRF-based approaches. For this purpose, our method learns a distribution over all possible radiance fields modelling which is used to quantify the uncertainty associated with the modelled scene. In contrast to previous approaches enforcing strong constraints over the radiance field distribution, CF-NeRF learns it in a flexible and fully data-driven manner by coupling Latent Variable Modelling and Conditional Normalizing Flows. This strategy allows to obtain reliable uncertainty estimation while preserving model expressivity. Compared to previous state-of-the-art methods proposed for uncertainty quantification in NeRF, our experiments show that the proposed method achieves significantly lower prediction errors and more reliable uncertainty values for synthetic novel view and depth-map estimation.

Via

Access Paper or Ask Questions

Enhancing Egocentric 3D Pose Estimation with Third Person Views

Jan 07, 2022

Ameya Dhamanaskar, Mariella Dimiccoli, Enric Corona, Albert Pumarola, Francesc Moreno-Noguer

Figure 1 for Enhancing Egocentric 3D Pose Estimation with Third Person Views

Figure 2 for Enhancing Egocentric 3D Pose Estimation with Third Person Views

Figure 3 for Enhancing Egocentric 3D Pose Estimation with Third Person Views

Figure 4 for Enhancing Egocentric 3D Pose Estimation with Third Person Views

Abstract:In this paper, we propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The key idea is to leverage high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. We explicitly consider spatial- and motion-domain features, combined using a semi-Siamese architecture trained in a self-supervised fashion. Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos, without needing domain adaptation nor knowledge of camera parameters. We achieve significant improvement of egocentric 3D body pose estimation performance on two unconstrained datasets, over three supervised state-of-the-art approaches. Our dataset and code will be available for research purposes.

Via

Access Paper or Ask Questions

PhysXNet: A Customizable Approach for LearningCloth Dynamics on Dressed People

Nov 13, 2021

Jordi Sanchez-Riera, Albert Pumarola, Francesc Moreno-Noguer

Figure 1 for PhysXNet: A Customizable Approach for LearningCloth Dynamics on Dressed People

Figure 2 for PhysXNet: A Customizable Approach for LearningCloth Dynamics on Dressed People

Figure 3 for PhysXNet: A Customizable Approach for LearningCloth Dynamics on Dressed People

Figure 4 for PhysXNet: A Customizable Approach for LearningCloth Dynamics on Dressed People

Abstract:We introduce PhysXNet, a learning-based approach to predict the dynamics of deformable clothes given 3D skeleton motion sequences of humans wearing these clothes. The proposed model is adaptable to a large variety of garments and changing topologies, without need of being retrained. Such simulations are typically carried out by physics engines that require manual human expertise and are subjectto computationally intensive computations. PhysXNet, by contrast, is a fully differentiable deep network that at inference is able to estimate the geometry of dense cloth meshes in a matter of milliseconds, and thus, can be readily deployed as a layer of a larger deep learning architecture. This efficiency is achieved thanks to the specific parameterization of the clothes we consider, based on 3D UV maps encoding spatial garment displacements. The problem is then formulated as a mapping between the human kinematics space (represented also by 3D UV maps of the undressed body mesh) into the clothes displacement UV maps, which we learn using a conditional GAN with a discriminator that enforces feasible deformations. We train simultaneously our model for three garment templates, tops, bottoms and dresses for which we simulate deformations under 50 different human actions. Nevertheless, the UV map representation we consider allows encapsulating many different cloth topologies, and at test we can simulate garments even if we did not specifically train for them. A thorough evaluation demonstrates that PhysXNet delivers cloth deformations very close to those computed with the physical engine, opening the door to be effectively integrated within deeplearning pipelines.

Via

Access Paper or Ask Questions

Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images

Nov 02, 2021

Nicolas Ugrinovic, Adria Ruiz, Antonio Agudo, Alberto Sanfeliu, Francesc Moreno-Noguer

Figure 1 for Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images

Figure 2 for Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images

Figure 3 for Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images

Figure 4 for Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images

Abstract:We address the problem of multi-person 3D body pose and shape estimation from a single image. While this problem can be addressed by applying single-person approaches multiple times for the same scene, recent works have shown the advantages of building upon deep architectures that simultaneously reason about all people in the scene in a holistic manner by enforcing, e.g., depth order constraints or minimizing interpenetration among reconstructed bodies. However, existing approaches are still unable to capture the size variability of people caused by the inherent body scale and depth ambiguity. In this work, we tackle this challenge by devising a novel optimization scheme that learns the appropriate body scale and relative camera pose, by enforcing the feet of all people to remain on the ground floor. A thorough evaluation on MuPoTS-3D and 3DPW datasets demonstrates that our approach is able to robustly estimate the body translation and shape of multiple people while retrieving their spatial arrangement, consistently improving current state-of-the-art, especially in scenes with people of very different heights

Via

Access Paper or Ask Questions