Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ignas Budvytis

HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation

May 11, 2023

Akash Sengupta, Ignas Budvytis, Roberto Cipolla

Abstract:Monocular 3D human pose and shape estimation is an ill-posed problem since multiple 3D solutions can explain a 2D image of a subject. Recent approaches predict a probability distribution over plausible 3D pose and shape parameters conditioned on the image. We show that these approaches exhibit a trade-off between three key properties: (i) accuracy - the likelihood of the ground-truth 3D solution under the predicted distribution, (ii) sample-input consistency - the extent to which 3D samples from the predicted distribution match the visible 2D image evidence, and (iii) sample diversity - the range of plausible 3D solutions modelled by the predicted distribution. Our method, HuManiFlow, predicts simultaneously accurate, consistent and diverse distributions. We use the human kinematic tree to factorise full body pose into ancestor-conditioned per-body-part pose distributions in an autoregressive manner. Per-body-part distributions are implemented using normalising flows that respect the manifold structure of SO(3), the Lie group of per-body-part poses. We show that ill-posed, but ubiquitous, 3D point estimate losses reduce sample diversity, and employ only probabilistic training losses. Code is available at: https://github.com/akashsengupta1997/HuManiFlow.

* CVPR 2023

Via

Access Paper or Ask Questions

IMP: Iterative Matching and Pose Estimation with Adaptive Pooling

Apr 28, 2023

Fei Xue, Ignas Budvytis, Roberto Cipolla

Figure 1 for IMP: Iterative Matching and Pose Estimation with Adaptive Pooling

Figure 2 for IMP: Iterative Matching and Pose Estimation with Adaptive Pooling

Figure 3 for IMP: Iterative Matching and Pose Estimation with Adaptive Pooling

Figure 4 for IMP: Iterative Matching and Pose Estimation with Adaptive Pooling

Abstract:Previous methods solve feature matching and pose estimation using a two-stage process by first finding matches and then estimating the pose. As they ignore the geometric relationships between the two tasks, they focus on either improving the quality of matches or filtering potential outliers, leading to limited efficiency or accuracy. In contrast, we propose an iterative matching and pose estimation framework (IMP) leveraging the geometric connections between the two tasks: a few good matches are enough for a roughly accurate pose estimation; a roughly accurate pose can be used to guide the matching by providing geometric constraints. To this end, we implement a geometry-aware recurrent attention-based module which jointly outputs sparse matches and camera poses. Specifically, for each iteration, we first implicitly embed geometric information into the module via a pose-consistency loss, allowing it to predict geometry-aware matches progressively. Second, we introduce an \textbf{e}fficient IMP, called EIMP, to dynamically discard keypoints without potential matches, avoiding redundant updating and significantly reducing the quadratic time complexity of attention computation in transformers. Experiments on YFCC100m, Scannet, and Aachen Day-Night datasets demonstrate that the proposed method outperforms previous approaches in terms of accuracy and efficiency.

* CVPR 2023. code available at https://github.com/feixue94/imp-release

Via

Access Paper or Ask Questions

SFD2: Semantic-guided Feature Detection and Description

Apr 28, 2023

Fei Xue, Ignas Budvytis, Roberto Cipolla

Figure 1 for SFD2: Semantic-guided Feature Detection and Description

Figure 2 for SFD2: Semantic-guided Feature Detection and Description

Figure 3 for SFD2: Semantic-guided Feature Detection and Description

Figure 4 for SFD2: Semantic-guided Feature Detection and Description

Abstract:Visual localization is a fundamental task for various applications including autonomous driving and robotics. Prior methods focus on extracting large amounts of often redundant locally reliable features, resulting in limited efficiency and accuracy, especially in large-scale environments under challenging conditions. Instead, we propose to extract globally reliable features by implicitly embedding high-level semantics into both the detection and description processes. Specifically, our semantic-aware detector is able to detect keypoints from reliable regions (e.g. building, traffic lane) and suppress unreliable areas (e.g. sky, car) implicitly instead of relying on explicit semantic labels. This boosts the accuracy of keypoint matching by reducing the number of features sensitive to appearance changes and avoiding the need of additional segmentation networks at test time. Moreover, our descriptors are augmented with semantics and have stronger discriminative ability, providing more inliers at test time. Particularly, experiments on long-term large-scale visual localization Aachen Day-Night and RobotCar-Seasons datasets demonstrate that our model outperforms previous local features and gives competitive accuracy to advanced matchers but is about 2 and 3 times faster when using 2k and 4k keypoints, respectively.

* CVPR 2023. code is available at https://github.com/feixue94/sfd2

Via

Access Paper or Ask Questions

A CNN Based Approach for the Point-Light Photometric Stereo Problem

Oct 10, 2022

Fotios Logothetis, Roberto Mecca, Ignas Budvytis, Roberto Cipolla

Abstract:Reconstructing the 3D shape of an object using several images under different light sources is a very challenging task, especially when realistic assumptions such as light propagation and attenuation, perspective viewing geometry and specular light reflection are considered. Many of works tackling Photometric Stereo (PS) problems often relax most of the aforementioned assumptions. Especially they ignore specular reflection and global illumination effects. In this work, we propose a CNN-based approach capable of handling these realistic assumptions by leveraging recent improvements of deep neural networks for far-field Photometric Stereo and adapt them to the point light setup. We achieve this by employing an iterative procedure of point-light PS for shape estimation which has two main steps. Firstly we train a per-pixel CNN to predict surface normals from reflectance samples. Secondly, we compute the depth by integrating the normal field in order to iteratively estimate light directions and attenuation which is used to compensate the input images to compute reflectance samples for the next iteration. Our approach sigificantly outperforms the state-of-the-art on the DiLiGenT real world dataset. Furthermore, in order to measure the performance of our approach for near-field point-light source PS data, we introduce LUCES the first real-world 'dataset for near-fieLd point light soUrCe photomEtric Stereo' of 14 objects of different materials were the effects of point light sources and perspective viewing are a lot more significant. Our approach also outperforms the competition on this dataset as well. Data and test code are available at the project page.

* arXiv admin note: text overlap with arXiv:2009.05792

Via

Access Paper or Ask Questions

IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty

Oct 07, 2022

Gwangbin Bae, Ignas Budvytis, Roberto Cipolla

Figure 1 for IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty

Figure 2 for IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty

Figure 3 for IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty

Figure 4 for IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty

Abstract:Single image surface normal estimation and depth estimation are closely related problems as the former can be calculated from the latter. However, the surface normals computed from the output of depth estimation methods are significantly less accurate than the surface normals directly estimated by networks. To reduce such discrepancy, we introduce a novel framework that uses surface normal and its uncertainty to recurrently refine the predicted depth-map. The depth of each pixel can be propagated to a query pixel, using the predicted surface normal as guidance. We thus formulate depth refinement as a classification of choosing the neighboring pixel to propagate from. Then, by propagating to sub-pixel points, we upsample the refined, low-resolution output. The proposed method shows state-of-the-art performance on NYUv2 and iBims-1 - both in terms of depth and normal. Our refinement module can also be attached to the existing depth estimation methods to improve their accuracy. We also show that our framework, only trained for depth estimation, can also be used for depth completion. The code is available at https://github.com/baegwangbin/IronDepth.

* BMVC 2022

Via

Access Paper or Ask Questions

SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

Oct 03, 2022

Florian Langer, Gwangbin Bae, Ignas Budvytis, Roberto Cipolla

Figure 1 for SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

Figure 2 for SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

Figure 3 for SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

Figure 4 for SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image

Abstract:Estimating 3D shapes and poses of static objects from a single image has important applications for robotics, augmented reality and digital content creation. Often this is done through direct mesh predictions which produces unrealistic, overly tessellated shapes or by formulating shape prediction as a retrieval task followed by CAD model alignment. Directly predicting CAD model poses from 2D image features is difficult and inaccurate. Some works, such as ROCA, regress normalised object coordinates and use those for computing poses. While this can produce more accurate pose estimates, predicting normalised object coordinates is susceptible to systematic failure. Leveraging efficient transformer architectures we demonstrate that a sparse, iterative, render-and-compare approach is more accurate and robust than relying on normalised object coordinates. For this we combine 2D image information including sparse depth and surface normal values which we estimate directly from the image with 3D CAD model information in early fusion. In particular, we reproject points sampled from the CAD model in an initial, random pose and compute their depth and surface normal values. This combined information is the input to a pose prediction network, SPARC-Net which we train to predict a 9 DoF CAD model pose update. The CAD model is reprojected again and the next pose update is predicted. Our alignment procedure converges after just 3 iterations, improving the state-of-the-art performance on the challenging real-world dataset ScanNet from 25.0% to 31.8% instance alignment accuracy. Code will be released at https://github.com/florianlanger/SPARC .

Via

Access Paper or Ask Questions

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Dec 15, 2021

Gwangbin Bae, Ignas Budvytis, Roberto Cipolla

Figure 1 for Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Figure 2 for Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Figure 3 for Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Figure 4 for Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Abstract:Multi-view depth estimation methods typically require the computation of a multi-view cost-volume, which leads to huge memory consumption and slow inference. Furthermore, multi-view matching can fail for texture-less surfaces, reflective surfaces and moving objects. For such failure modes, single-view depth estimation methods are often more reliable. To this end, we propose MaGNet, a novel framework for fusing single-view depth probability with multi-view geometry, to improve the accuracy, robustness and efficiency of multi-view depth estimation. For each frame, MaGNet estimates a single-view depth probability distribution, parameterized as a pixel-wise Gaussian. The distribution estimated for the reference frame is then used to sample per-pixel depth candidates. Such probabilistic sampling enables the network to achieve higher accuracy while evaluating fewer depth candidates. We also propose depth consistency weighting for the multi-view matching score, to ensure that the multi-view depth is consistent with the single-view predictions. The proposed method achieves state-of-the-art performance on ScanNet, 7-Scenes and KITTI. Qualitative evaluation demonstrates that our method is more robust against challenging artifacts such as texture-less/reflective surfaces and moving objects.

Via

Access Paper or Ask Questions

Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model

Nov 30, 2021

Akash Sengupta, Ignas Budvytis, Roberto Cipolla

Figure 1 for Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model

Figure 2 for Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model

Figure 3 for Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model

Figure 4 for Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model

Abstract:This paper addresses the problem of 3D human body shape and pose estimation from RGB images. Some recent approaches to this task predict probability distributions over human body model parameters conditioned on the input images. This is motivated by the ill-posed nature of the problem wherein multiple 3D reconstructions may match the image evidence, particularly when some parts of the body are locally occluded. However, body shape parameters in widely-used body models (e.g. SMPL) control global deformations over the whole body surface. Distributions over these global shape parameters are unable to meaningfully capture uncertainty in shape estimates associated with locally-occluded body parts. In contrast, we present a method that (i) predicts distributions over local body shape in the form of semantic body measurements and (ii) uses a linear mapping to transform a local distribution over body measurements to a global distribution over SMPL shape parameters. We show that our method outperforms the current state-of-the-art in terms of identity-dependent body shape estimation accuracy on the SSP-3D dataset, and a private dataset of tape-measured humans, by probabilistically-combining local body measurement distributions predicted from multiple images of a subject.

* BMVC 2021

Via

Access Paper or Ask Questions

Leveraging Geometry for Shape Estimation from a Single RGB Image

Nov 10, 2021

Florian Langer, Ignas Budvytis, Roberto Cipolla

Figure 1 for Leveraging Geometry for Shape Estimation from a Single RGB Image

Figure 2 for Leveraging Geometry for Shape Estimation from a Single RGB Image

Figure 3 for Leveraging Geometry for Shape Estimation from a Single RGB Image

Figure 4 for Leveraging Geometry for Shape Estimation from a Single RGB Image

Abstract:Predicting 3D shapes and poses of static objects from a single RGB image is an important research area in modern computer vision. Its applications range from augmented reality to robotics and digital content creation. Typically this task is performed through direct object shape and pose predictions which is inaccurate. A promising research direction ensures meaningful shape predictions by retrieving CAD models from large scale databases and aligning them to the objects observed in the image. However, existing work does not take the object geometry into account, leading to inaccurate object pose predictions, especially for unseen objects. In this work we demonstrate how cross-domain keypoint matches from an RGB image to a rendered CAD model allow for more precise object pose predictions compared to ones obtained through direct predictions. We further show that keypoint matches can not only be used to estimate the pose of an object, but also to modify the shape of the object itself. This is important as the accuracy that can be achieved with object retrieval alone is inherently limited to the available CAD models. Allowing shape adaptation bridges the gap between the retrieved CAD model and the observed shape. We demonstrate our approach on the challenging Pix3D dataset. The proposed geometric shape prediction improves the AP mesh over the state-of-the-art from 33.2 to 37.8 on seen objects and from 8.2 to 17.1 on unseen objects. Furthermore, we demonstrate more accurate shape predictions without closely matching CAD models when following the proposed shape adaptation. Code is publicly available at https://github.com/florianlanger/leveraging_geometry_for_shape_estimation .

Via

Access Paper or Ask Questions

Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild

Oct 03, 2021

Akash Sengupta, Ignas Budvytis, Roberto Cipolla

Figure 1 for Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild

Figure 2 for Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild

Figure 3 for Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild

Figure 4 for Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild

Abstract:This paper addresses the problem of 3D human body shape and pose estimation from an RGB image. This is often an ill-posed problem, since multiple plausible 3D bodies may match the visual evidence present in the input - particularly when the subject is occluded. Thus, it is desirable to estimate a distribution over 3D body shape and pose conditioned on the input image instead of a single 3D reconstruction. We train a deep neural network to estimate a hierarchical matrix-Fisher distribution over relative 3D joint rotation matrices (i.e. body pose), which exploits the human body's kinematic tree structure, as well as a Gaussian distribution over SMPL body shape parameters. To further ensure that the predicted shape and pose distributions match the visual evidence in the input image, we implement a differentiable rejection sampler to impose a reprojection loss between ground-truth 2D joint coordinates and samples from the predicted distributions, projected onto the image plane. We show that our method is competitive with the state-of-the-art in terms of 3D shape and pose metrics on the SSP-3D and 3DPW datasets, while also yielding a structured probability distribution over 3D body shape and pose, with which we can meaningfully quantify prediction uncertainty and sample multiple plausible 3D reconstructions to explain a given input image. Code is available at https://github.com/akashsengupta1997/HierarchicalProbabilistic3DHuman .

* ICCV 2021

Via

Access Paper or Ask Questions