Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bastian Leibe

MOTS: Multi-Object Tracking and Segmentation

Feb 10, 2019
Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, Bastian Leibe

Figure 1 for MOTS: Multi-Object Tracking and Segmentation

Figure 2 for MOTS: Multi-Object Tracking and Segmentation

Figure 3 for MOTS: Multi-Object Tracking and Segmentation

Figure 4 for MOTS: Multi-Object Tracking and Segmentation

This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 70,430 pixel masks for 1,084 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend existing multi-object tracking metrics to this new task. Moreover, we propose a new baseline method which jointly addresses detection, tracking, and segmentation with a single convolutional network. We demonstrate the value of our datasets by achieving improvements in performance when training on MOTS annotations. We believe that our datasets, metrics and baseline will become a valuable resource towards developing multi-object tracking approaches that go beyond 2D bounding boxes.

Via

Access Paper or Ask Questions

4D Generic Video Object Proposals

Jan 26, 2019
Aljosa Osep, Paul Voigtlaender, Mark Weber, Jonathon Luiten, Bastian Leibe

Figure 1 for 4D Generic Video Object Proposals

Figure 2 for 4D Generic Video Object Proposals

Figure 3 for 4D Generic Video Object Proposals

Figure 4 for 4D Generic Video Object Proposals

Many high-level video understanding methods require input in the form of object proposals. Currently, such proposals are predominantly generated with the help of networks that were trained for detecting and segmenting a set of known object classes, which limits their applicability to cases where all objects of interest are represented in the training set. This is a restriction for automotive scenarios, where unknown objects can frequently occur. We propose an approach that can reliably extract spatio-temporal object proposals for both known and unknown object categories from stereo video. Our 4D Generic Video Tubes (4D-GVT) method leverages motion cues, stereo data, and object instance segmentation to compute a compact set of video-object proposals that precisely localizes object candidates and their contours in 3D space and time. We show that given only a small amount of labeled data, our 4D-GVT proposal generator generalizes well to real-world scenarios, in which unknown categories appear. It outperforms other approaches that try to detect as many objects as possible by increasing the number of classes in the training set to several thousand.

* 16 pages (10 paper + 6 supplementary), 11 figures, 11 tables

Via

Access Paper or Ask Questions

Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation

Nov 06, 2018
István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe

Figure 1 for Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation

Figure 2 for Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation

Figure 3 for Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation

Figure 4 for Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation

In this paper we present our winning entry at the 2018 ECCV PoseTrack Challenge on 3D human pose estimation. Using a fully-convolutional backbone architecture, we obtain volumetric heatmaps per body joint, which we convert to coordinates using soft-argmax. Absolute person center depth is estimated by a 1D heatmap prediction head. The coordinates are back-projected to 3D camera space, where we minimize the L1 loss. Key to our good results is the training data augmentation with randomly placed occluders from the Pascal VOC dataset. In addition to reaching first place in the Challenge, our method also surpasses the state-of-the-art on the full Human3.6M benchmark among methods that use no additional pose datasets in training. Code for applying synthetic occlusions is availabe at https://github.com/isarandi/synthetic-occlusion.

* Extended abstract for the 2018 ECCV PoseTrack Workshop, updated with full result tables

Via

Access Paper or Ask Questions

PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation

Nov 03, 2018
Jonathon Luiten, Paul Voigtlaender, Bastian Leibe

Figure 1 for PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation

Figure 2 for PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation

Figure 3 for PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation

Figure 4 for PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation

We address semi-supervised video object segmentation, the task of automatically generating accurate and consistent pixel masks for objects in a video sequence, given the first-frame ground truth annotations. Towards this goal, we present the PReMVOS algorithm (Proposal-generation, Refinement and Merging for Video Object Segmentation). Our method separates this problem into two steps, first generating a set of accurate object segmentation mask proposals for each video frame and then selecting and merging these proposals into accurate and temporally consistent pixel-wise object tracks over a video sequence in a way which is designed to specifically tackle the difficult challenges involved with segmenting multiple objects across a video sequence. Our approach surpasses all previous state-of-the-art results on the DAVIS 2017 video object segmentation benchmark with a J & F mean score of 71.6 on the test-dev dataset, and achieves first place in both the DAVIS 2018 Video Object Segmentation Challenge and the YouTube-VOS 1st Large-scale Video Object Segmentation Challenge.

* Accepted for publication in ACCV18

Via

Access Paper or Ask Questions

Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds

Oct 02, 2018
Francis Engelmann, Theodora Kontogianni, Jonas Schult, Bastian Leibe

Figure 1 for Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds

Figure 2 for Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds

Figure 3 for Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds

Figure 4 for Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds

In this paper, we present a deep learning architecture which addresses the problem of 3D semantic segmentation of unstructured point clouds. Compared to previous work, we introduce grouping techniques which define point neighborhoods in the initial world space and the learned feature space. Neighborhoods are important as they allow to compute local or global point features depending on the spatial extend of the neighborhood. Additionally, we incorporate dedicated loss functions to further structure the learned point feature space: the pairwise distance loss and the centroid loss. We show how to apply these mechanisms to the task of 3D semantic segmentation of point clouds and report state-of-the-art performance on indoor and outdoor datasets.

Via

Access Paper or Ask Questions

Combined Image- and World-Space Tracking in Traffic Scenes

Sep 19, 2018
Aljosa Osep, Wolfgang Mehner, Markus Mathias, Bastian Leibe

Figure 1 for Combined Image- and World-Space Tracking in Traffic Scenes

Figure 2 for Combined Image- and World-Space Tracking in Traffic Scenes

Figure 3 for Combined Image- and World-Space Tracking in Traffic Scenes

Figure 4 for Combined Image- and World-Space Tracking in Traffic Scenes

Tracking in urban street scenes plays a central role in autonomous systems such as self-driving cars. Most of the current vision-based tracking methods perform tracking in the image domain. Other approaches, eg based on LIDAR and radar, track purely in 3D. While some vision-based tracking methods invoke 3D information in parts of their pipeline, and some 3D-based methods utilize image-based information in components of their approach, we propose to use image- and world-space information jointly throughout our method. We present our tracking pipeline as a 3D extension of image-based tracking. From enhancing the detections with 3D measurements to the reported positions of every tracked object, we use world-space 3D information at every stage of processing. We accomplish this by our novel coupled 2D-3D Kalman filter, combined with a conceptually clean and extendable hypothesize-and-select framework. Our approach matches the current state-of-the-art on the official KITTI benchmark, which performs evaluation in the 2D image domain only. Further experiments show significant improvements in 3D localization precision by enabling our coupled 2D-3D tracking.

* 8 pages, 7 figures, 2 tables. ICRA 2017 paper

Via

Access Paper or Ask Questions

Towards Large-Scale Video Video Object Mining

Sep 19, 2018
Aljosa Osep, Paul Voigtlaender, Jonathon Luiten, Stefan Breuers, Bastian Leibe

Figure 1 for Towards Large-Scale Video Video Object Mining

Figure 2 for Towards Large-Scale Video Video Object Mining

Figure 3 for Towards Large-Scale Video Video Object Mining

We propose to leverage a generic object tracker in order to perform object mining in large-scale unlabeled videos, captured in a realistic automotive setting. We present a dataset of more than 360'000 automatically mined object tracks from 10+ hours of video data (560'000 frames) and propose a method for automated novel category discovery and detector learning. In addition, we show preliminary results on using the mined tracks for object detector adaptation.

* 4 pages, 3 figures, 1 table. ECCV 2018 Workshop on Interactive and Adaptive Learning in an Open World

Via

Access Paper or Ask Questions

How Robust is 3D Human Pose Estimation to Occlusion?

Aug 29, 2018
István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe

Figure 1 for How Robust is 3D Human Pose Estimation to Occlusion?

Figure 2 for How Robust is 3D Human Pose Estimation to Occlusion?

Figure 3 for How Robust is 3D Human Pose Estimation to Occlusion?

Figure 4 for How Robust is 3D Human Pose Estimation to Occlusion?

Occlusion is commonplace in realistic human-robot shared environments, yet its effects are not considered in standard 3D human pose estimation benchmarks. This leaves the question open: how robust are state-of-the-art 3D pose estimation methods against partial occlusions? We study several types of synthetic occlusions over the Human3.6M dataset and find a method with state-of-the-art benchmark performance to be sensitive even to low amounts of occlusion. Addressing this issue is key to progress in applications such as collaborative and service robotics. We take a first step in this direction by improving occlusion-robustness through training data augmentation with synthetic occlusions. This also turns out to be an effective regularizer that is beneficial even for non-occluded test cases.

* Accepted for IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'18) - Workshop on Robotic Co-workers 4.0: Human Safety and Comfort in Human-Robot Interactive Social Environments

Via

Access Paper or Ask Questions

Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline

Jul 28, 2018
Stefan Breuers, Lucas Beyer, Umer Rafi, Bastian Leibe

Figure 1 for Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline

Figure 2 for Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline

Figure 3 for Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline

Figure 4 for Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline

In the past decade many robots were deployed in the wild, and people detection and tracking is an important component of such deployments. On top of that, one often needs to run modules which analyze persons and extract higher level attributes such as age and gender, or dynamic information like gaze and pose. The latter ones are especially necessary for building a reactive, social robot-person interaction. In this paper, we combine those components in a fully modular detection-tracking-analysis pipeline, called DetTA. We investigate the benefits of such an integration on the example of head and skeleton pose, by using the consistent track ID for a temporal filtering of the analysis modules' observations, showing a slight improvement in a challenging real-world scenario. We also study the potential of a so-called "free-flight" mode, where the analysis of a person attribute only relies on the filter's predictions for certain frames. Here, our study shows that this boosts the runtime dramatically, while the prediction quality remains stable. This insight is especially important for reducing power consumption and sharing precious (GPU-)memory when running many analysis components on a mobile platform, especially so in the era of expensive deep learning methods.

* Code available at: https://github.com/sbreuers/detta

Via

Access Paper or Ask Questions

Iteratively Trained Interactive Segmentation

May 11, 2018
Sabarinath Mahadevan, Paul Voigtlaender, Bastian Leibe

Figure 1 for Iteratively Trained Interactive Segmentation

Figure 2 for Iteratively Trained Interactive Segmentation

Figure 3 for Iteratively Trained Interactive Segmentation

Figure 4 for Iteratively Trained Interactive Segmentation

Deep learning requires large amounts of training data to be effective. For the task of object segmentation, manually labeling data is very expensive, and hence interactive methods are needed. Following recent approaches, we develop an interactive object segmentation system which uses user input in the form of clicks as the input to a convolutional network. While previous methods use heuristic click sampling strategies to emulate user clicks during training, we propose a new iterative training strategy. During training, we iteratively add clicks based on the errors of the currently predicted segmentation. We show that our iterative training strategy together with additional improvements to the network architecture results in improved results over the state-of-the-art.

Via

Access Paper or Ask Questions