Alert button
Picture for Martin R. Oswald

Martin R. Oswald

Alert button

Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery

Aug 05, 2023
Florentin Liebmann, Marco von Atzigen, Dominik Stütz, Julian Wolf, Lukas Zingg, Daniel Suter, Laura Leoty, Hooman Esfandiari, Jess G. Snedeker, Martin R. Oswald, Marc Pollefeys, Mazda Farshad, Philipp Fürnstahl

Figure 1 for Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery
Figure 2 for Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery
Figure 3 for Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery
Figure 4 for Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery

Established surgical navigation systems for pedicle screw placement have been proven to be accurate, but still reveal limitations in registration or surgical guidance. Registration of preoperative data to the intraoperative anatomy remains a time-consuming, error-prone task that includes exposure to harmful radiation. Surgical guidance through conventional displays has well-known drawbacks, as information cannot be presented in-situ and from the surgeon's perspective. Consequently, radiation-free and more automatic registration methods with subsequent surgeon-centric navigation feedback are desirable. In this work, we present an approach that automatically solves the registration problem for lumbar spinal fusion surgery in a radiation-free manner. A deep neural network was trained to segment the lumbar spine and simultaneously predict its orientation, yielding an initial pose for preoperative models, which then is refined for each vertebra individually and updated in real-time with GPU acceleration while handling surgeon occlusions. An intuitive surgical guidance is provided thanks to the integration into an augmented reality based navigation system. The registration method was verified on a public dataset with a mean of 96\% successful registrations, a target registration error of 2.73 mm, a screw trajectory error of 1.79{\deg} and a screw entry point error of 2.43 mm. Additionally, the whole pipeline was validated in an ex-vivo surgery, yielding a 100\% screw accuracy and a registration accuracy of 1.20 mm. Our results meet clinical demands and emphasize the potential of RGB-D data for fully automatic registration approaches in combination with augmented reality guidance.

Viaarxiv icon

The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes

Jun 29, 2023
David Recasens, Martin R. Oswald, Marc Pollefeys, Javier Civera

Figure 1 for The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes
Figure 2 for The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes
Figure 3 for The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes
Figure 4 for The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes

Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pipelines, which tackle the most challenging scenario of exploratory trajectories, suffer from a lack of robustness and proper quantitative evaluation methodologies. To tackle this issue with a common benchmark, we introduce the Drunkard's Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality. We further present a novel deformable odometry method, dubbed the Drunkard's Odometry, which decomposes optical flow estimates into rigid-body camera motion and non-rigid scene deformations. In order to validate our data, our work contains an evaluation of several baselines as well as a novel tracking error metric which does not require ground truth data. Dataset and code: https://davidrecasens.github.io/TheDrunkard'sOdometry/

Viaarxiv icon

UncLe-SLAM: Uncertainty Learning for Dense Neural SLAM

Jun 19, 2023
Erik Sandström, Kevin Ta, Luc Van Gool, Martin R. Oswald

Figure 1 for UncLe-SLAM: Uncertainty Learning for Dense Neural SLAM
Figure 2 for UncLe-SLAM: Uncertainty Learning for Dense Neural SLAM
Figure 3 for UncLe-SLAM: Uncertainty Learning for Dense Neural SLAM
Figure 4 for UncLe-SLAM: Uncertainty Learning for Dense Neural SLAM

We present an uncertainty learning framework for dense neural simultaneous localization and mapping (SLAM). Estimating pixel-wise uncertainties for the depth input of dense SLAM methods allows to re-weigh the tracking and mapping losses towards image regions that contain more suitable information that is more reliable for SLAM. To this end, we propose an online framework for sensor uncertainty estimation that can be trained in a self-supervised manner from only 2D input data. We further discuss the advantages of the uncertainty learning for the case of multi-sensor input. Extensive analysis, experimentation, and ablations show that our proposed modeling paradigm improves both mapping and tracking accuracy and often performs better than alternatives that require ground truth depth or 3D. Our experiments show that we achieve a 38% and 27% lower absolute trajectory tracking error (ATE) on the 7-Scenes and TUM-RGBD datasets respectively. On the popular Replica dataset on two types of depth sensors we report an 11% F1-score improvement on RGBD SLAM compared to the recent state-of-the-art neural implicit approaches. Our source code will be made available.

* 12 pages, 5 figures, 5 tables 
Viaarxiv icon

R-MAE: Regions Meet Masked Autoencoders

Jun 08, 2023
Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen

Figure 1 for R-MAE: Regions Meet Masked Autoencoders
Figure 2 for R-MAE: Regions Meet Masked Autoencoders
Figure 3 for R-MAE: Regions Meet Masked Autoencoders
Figure 4 for R-MAE: Regions Meet Masked Autoencoders

Vision-specific concepts such as "region" have played a key role in extending general machine learning frameworks to tasks like object detection. Given the success of region-based detectors for supervised learning and the progress of intra-image methods for contrastive learning, we explore the use of regions for reconstructive pre-training. Starting from Masked Autoencoding (MAE) both as a baseline and an inspiration, we propose a parallel pre-text task tailored to address the one-to-many mapping between images and regions. Since such regions can be generated in an unsupervised way, our approach (R-MAE) inherits the wide applicability from MAE, while being more "region-aware". We conduct thorough analyses during the development of R-MAE, and converge on a variant that is both effective and efficient (1.3% overhead over MAE). Moreover, it shows consistent quantitative improvements when generalized to various pre-training data and downstream detection and segmentation benchmarks. Finally, we provide extensive qualitative visualizations to enhance the understanding of R-MAE's behaviour and potential. Code will be made available at https://github.com/facebookresearch/r-mae.

Viaarxiv icon

Learning-based Relational Object Matching Across Views

May 03, 2023
Cathrin Elich, Iro Armeni, Martin R. Oswald, Marc Pollefeys, Joerg Stueckler

Figure 1 for Learning-based Relational Object Matching Across Views
Figure 2 for Learning-based Relational Object Matching Across Views
Figure 3 for Learning-based Relational Object Matching Across Views
Figure 4 for Learning-based Relational Object Matching Across Views

Intelligent robots require object-level scene understanding to reason about possible tasks and interactions with the environment. Moreover, many perception tasks such as scene reconstruction, image retrieval, or place recognition can benefit from reasoning on the level of objects. While keypoint-based matching can yield strong results for finding correspondences for images with small to medium view point changes, for large view point changes, matching semantically on the object-level becomes advantageous. In this paper, we propose a learning-based approach which combines local keypoints with novel object-level features for matching object detections between RGB images. We train our object-level matching features based on appearance and inter-frame and cross-frame spatial relations between objects in an associative graph neural network. We demonstrate our approach in a large variety of views on realistically rendered synthetic images. Our approach compares favorably to previous state-of-the-art object-level matching approaches and achieves improved performance over a pure keypoint-based approach for large view-point changes.

* Accepted for publication in IEEE International Conference on Robotics and Automation (ICRA), 2023 
Viaarxiv icon

Tracking by 3D Model Estimation of Unknown Objects in Videos

Apr 13, 2023
Denys Rozumnyi, Jiri Matas, Marc Pollefeys, Vittorio Ferrari, Martin R. Oswald

Figure 1 for Tracking by 3D Model Estimation of Unknown Objects in Videos
Figure 2 for Tracking by 3D Model Estimation of Unknown Objects in Videos
Figure 3 for Tracking by 3D Model Estimation of Unknown Objects in Videos
Figure 4 for Tracking by 3D Model Estimation of Unknown Objects in Videos

Most model-free visual object tracking methods formulate the tracking task as object location estimation given by a 2D segmentation or a bounding box in each video frame. We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames, including frames where some points are invisible. To achieve that, the estimation is driven by re-rendering the input video frames as well as possible through differentiable rendering, which has not been used for tracking before. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose. We improve the state-of-the-art in 2D segmentation tracking on three different datasets with mostly rigid objects.

Viaarxiv icon

Point-SLAM: Dense Neural Point Cloud-based SLAM

Apr 09, 2023
Erik Sandström, Yue Li, Luc Van Gool, Martin R. Oswald

Figure 1 for Point-SLAM: Dense Neural Point Cloud-based SLAM
Figure 2 for Point-SLAM: Dense Neural Point Cloud-based SLAM
Figure 3 for Point-SLAM: Dense Neural Point Cloud-based SLAM
Figure 4 for Point-SLAM: Dense Neural Point Cloud-based SLAM

We propose a dense neural simultaneous localization and mapping (SLAM) approach for monocular RGBD input which anchors the features of a neural scene representation in a point cloud that is iteratively generated in an input-dependent data-driven manner. We demonstrate that both tracking and mapping can be performed with the same point-based neural scene representation by minimizing an RGBD-based re-rendering loss. In contrast to recent dense neural SLAM methods which anchor the scene features in a sparse grid, our point-based approach allows dynamically adapting the anchor point density to the information density of the input. This strategy reduces runtime and memory usage in regions with fewer details and dedicates higher point density to resolve fine details. Our approach performs either better or competitive to existing dense neural RGBD SLAM methods in tracking, mapping and rendering accuracy on the Replica, TUM-RGBD and ScanNet datasets. The source code is available at https://github.com/tfy14esa/Point-SLAM.

* 17 Pages, 10 Figures 
Viaarxiv icon

Human from Blur: Human Pose Tracking from Blurry Images

Mar 30, 2023
Yiming Zhao, Denys Rozumnyi, Jie Song, Otmar Hilliges, Marc Pollefeys, Martin R. Oswald

Figure 1 for Human from Blur: Human Pose Tracking from Blurry Images
Figure 2 for Human from Blur: Human Pose Tracking from Blurry Images
Figure 3 for Human from Blur: Human Pose Tracking from Blurry Images
Figure 4 for Human from Blur: Human Pose Tracking from Blurry Images

We propose a method to estimate 3D human poses from substantially blurred images. The key idea is to tackle the inverse problem of image deblurring by modeling the forward problem with a 3D human model, a texture map, and a sequence of poses to describe human motion. The blurring process is then modeled by a temporal image aggregation step. Using a differentiable renderer, we can solve the inverse problem by backpropagating the pixel-wise reprojection error to recover the best human motion representation that explains a single or multiple input images. Since the image reconstruction loss alone is insufficient, we present additional regularization terms. To the best of our knowledge, we present the first method to tackle this problem. Our method consistently outperforms other methods on significantly blurry inputs since they lack one or multiple key functionalities that our method unifies, i.e. image deblurring with sub-frame accuracy and explicit 3D modeling of non-rigid human motion.

Viaarxiv icon

NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM

Feb 07, 2023
Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R. Oswald, Andreas Geiger, Marc Pollefeys

Figure 1 for NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM
Figure 2 for NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM
Figure 3 for NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM
Figure 4 for NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM

Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM. However, previous works in this direction either rely on RGB-D sensors, or require a separate monocular SLAM approach for camera tracking and do not produce high-fidelity dense 3D scene reconstruction. In this paper, we present NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, which also allows for high-quality novel view synthesis. To facilitate the optimization process for mapping, we integrate additional supervision signals including easy-to-obtain monocular geometric cues and optical flow, and also introduce a simple warping loss to further enforce geometry consistency. Moreover, to further boost performance in complicated indoor scenes, we also propose a local adaptive transformation from signed distance functions (SDFs) to density in the volume rendering equation. On both synthetic and real-world datasets we demonstrate strong performance in dense mapping, tracking, and novel view synthesis, even competitive with recent RGB-D SLAM systems.

* Video: https://youtu.be/tUXzqEZWg2w 
Viaarxiv icon

Detecting Objects with Graph Priors and Graph Refinement

Dec 23, 2022
Aritra Bhowmik, Martin R. Oswald, Yu Wang, Nora Baka, Cees G. M. Snoek

Figure 1 for Detecting Objects with Graph Priors and Graph Refinement
Figure 2 for Detecting Objects with Graph Priors and Graph Refinement
Figure 3 for Detecting Objects with Graph Priors and Graph Refinement
Figure 4 for Detecting Objects with Graph Priors and Graph Refinement

The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.

* 13 pages, 8 figures 
Viaarxiv icon