Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Juergen Gall

3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Aug 12, 2023
Shuxiao Ding, Eike Rehder, Lukas Schneider, Marius Cordts, Juergen Gall

Figure 1 for 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Figure 2 for 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Figure 3 for 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Figure 4 for 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer.

* 17 pages, 8 figures, accepted by ICCV2023

Via

Access Paper or Ask Questions

Action Anticipation with Goal Consistency

Jun 26, 2023
Olga Zatsarynna, Juergen Gall

Figure 1 for Action Anticipation with Goal Consistency

Figure 2 for Action Anticipation with Goal Consistency

Figure 3 for Action Anticipation with Goal Consistency

Figure 4 for Action Anticipation with Goal Consistency

In this paper, we address the problem of short-term action anticipation, i.e., we want to predict an upcoming action one second before it happens. We propose to harness high-level intent information to anticipate actions that will take place in the future. To this end, we incorporate an additional goal prediction branch into our model and propose a consistency loss function that encourages the anticipated actions to conform to the high-level goal pursued in the video. In our experiments, we show the effectiveness of the proposed approach and demonstrate that our method achieves state-of-the-art results on two large-scale datasets: Assembly101 and COIN.

* Accepted to ICIP 2023

Via

Access Paper or Ask Questions

PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

Jun 19, 2023
Peizheng Li, Shuxiao Ding, Xieyuanli Chen, Niklas Hanselmann, Marius Cordts, Juergen Gall

Figure 1 for PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

Figure 2 for PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

Figure 3 for PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

Figure 4 for PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

Accurately perceiving instances and predicting their future motion are key tasks for autonomous vehicles, enabling them to navigate safely in complex urban traffic. While bird's-eye view (BEV) representations are commonplace in perception for autonomous driving, their potential in a motion prediction setting is less explored. Existing approaches for BEV instance prediction from surround cameras rely on a multi-task auto-regressive setup coupled with complex post-processing to predict future instances in a spatio-temporally consistent manner. In this paper, we depart from this paradigm and propose an efficient novel end-to-end framework named POWERBEV, which differs in several design choices aimed at reducing the inherent redundancy in previous methods. First, rather than predicting the future in an auto-regressive fashion, POWERBEV uses a parallel, multi-scale module built from lightweight 2D convolutional networks. Second, we show that segmentation and centripetal backward flow are sufficient for prediction, simplifying previous multi-task objectives by eliminating redundant output modalities. Building on this output representation, we propose a simple, flow warping-based post-processing approach which produces more stable instance associations across time. Through this lightweight yet powerful design, POWERBEV outperforms state-of-the-art baselines on the NuScenes Dataset and poses an alternative paradigm for BEV instance prediction. We made our code publicly available at: https://github.com/EdwardLeeLPZ/PowerBEV.

* 12 pages, 8 figures. This paper is accepted by IJCAI2023. Peizheng Li and Shuxiao Ding contributed equally to this work

Via

Access Paper or Ask Questions

A Dual-Source Attention Transformer for Multi-Person Pose Tracking

Jun 09, 2023
Andreas Doering, Juergen Gall

Figure 1 for A Dual-Source Attention Transformer for Multi-Person Pose Tracking

Figure 2 for A Dual-Source Attention Transformer for Multi-Person Pose Tracking

Figure 3 for A Dual-Source Attention Transformer for Multi-Person Pose Tracking

Figure 4 for A Dual-Source Attention Transformer for Multi-Person Pose Tracking

Multi-person pose tracking is an important element for many applications and requires to estimate the human poses of all persons in a video and to track them over time. The association of poses across frames remains an open research problem, in particular for online tracking methods, due to motion blur, crowded scenes and occlusions. To tackle the association challenge, we propose a Dual-Source Attention Transformer that incorporates three core aspects: i) In order to re-identify persons that have been occluded, we propose a pose-conditioned re-identification network that provides an initial embedding and allows to match persons even if the number of visible joints differs between the frames. ii) We incorporate edge embeddings based on temporal pose similarity and the impact of appearance and pose similarity is automatically adapted. iii) We propose an attention based matching layer for pose-to-track association and duplicate removal. We evaluate our approach on Market1501, PoseTrack 2018 and PoseTrack21.

Via

Access Paper or Ask Questions

Location-aware Adaptive Denormalization: A Deep Learning Approach For Wildfire Danger Forecasting

Dec 16, 2022
Mohamad Hakam Shams Eddin, Ribana Roscher, Juergen Gall

Figure 1 for Location-aware Adaptive Denormalization: A Deep Learning Approach For Wildfire Danger Forecasting

Figure 2 for Location-aware Adaptive Denormalization: A Deep Learning Approach For Wildfire Danger Forecasting

Figure 3 for Location-aware Adaptive Denormalization: A Deep Learning Approach For Wildfire Danger Forecasting

Figure 4 for Location-aware Adaptive Denormalization: A Deep Learning Approach For Wildfire Danger Forecasting

Climate change is expected to intensify and increase extreme events in the weather cycle. Since this has a significant impact on various sectors of our life, recent works are concerned with identifying and predicting such extreme events from Earth observations. This paper proposes a 2D/3D two-branch convolutional neural network (CNN) for wildfire danger forecasting. To use a unified framework, previous approaches duplicate static variables along the time dimension and neglect the intrinsic differences between static and dynamic variables. Furthermore, most existing multi-branch architectures lose the interconnections between the branches during the feature learning stage. To address these issues, we propose a two-branch architecture with a Location-aware Adaptive Denormalization layer (LOADE). Using LOADE as a building block, we can modulate the dynamic features conditional on their geographical location. Thus, our approach considers feature properties as a unified yet compound 2D/3D model. Besides, we propose using an absolute temporal encoding for time-related forecasting problems. Our experimental results show a better performance of our approach than other baselines on the challenging FireCube dataset.

Via

Access Paper or Ask Questions

Robust Action Segmentation from Timestamp Supervision

Oct 12, 2022
Yaser Souri, Yazan Abu Farha, Emad Bahrami, Gianpiero Francesca, Juergen Gall

Figure 1 for Robust Action Segmentation from Timestamp Supervision

Figure 2 for Robust Action Segmentation from Timestamp Supervision

Figure 3 for Robust Action Segmentation from Timestamp Supervision

Figure 4 for Robust Action Segmentation from Timestamp Supervision

Action segmentation is the task of predicting an action label for each frame of an untrimmed video. As obtaining annotations to train an approach for action segmentation in a fully supervised way is expensive, various approaches have been proposed to train action segmentation models using different forms of weak supervision, e.g., action transcripts, action sets, or more recently timestamps. Timestamp supervision is a promising type of weak supervision as obtaining one timestamp per action is less expensive than annotating all frames, but it provides more information than other forms of weak supervision. However, previous works assume that every action instance is annotated with a timestamp, which is a restrictive assumption since it assumes that annotators do not miss any action. In this work, we relax this restrictive assumption and take missing annotations for some action instances into account. We show that our approach is more robust to missing annotations compared to other approaches and various baselines.

* BMVC 2022

Via

Access Paper or Ask Questions

Dual Pyramid Generative Adversarial Networks for Semantic Image Synthesis

Oct 08, 2022
Shijie Li, Ming-Ming Cheng, Juergen Gall

Figure 1 for Dual Pyramid Generative Adversarial Networks for Semantic Image Synthesis

Figure 2 for Dual Pyramid Generative Adversarial Networks for Semantic Image Synthesis

Figure 3 for Dual Pyramid Generative Adversarial Networks for Semantic Image Synthesis

Figure 4 for Dual Pyramid Generative Adversarial Networks for Semantic Image Synthesis

The goal of semantic image synthesis is to generate photo-realistic images from semantic label maps. It is highly relevant for tasks like content generation and image editing. Current state-of-the-art approaches, however, still struggle to generate realistic objects in images at various scales. In particular, small objects tend to fade away and large objects are often generated as collages of patches. In order to address this issue, we propose a Dual Pyramid Generative Adversarial Network (DP-GAN) that learns the conditioning of spatially-adaptive normalization blocks at all scales jointly, such that scale information is bi-directionally used, and it unifies supervision at different scales. Our qualitative and quantitative results show that the proposed approach generates images where small and large objects look more realistic compared to images generated by state-of-the-art methods.

* BMVC2022

Via

Access Paper or Ask Questions

Self-supervised Learning for Unintentional Action Prediction

Sep 24, 2022
Olga Zatsarynna, Yazan Abu Farha, Juergen Gall

Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straightforward due to lack of annotated data. While videos of unintentional or failed actions can be found in the Internet in abundance, high annotation costs are a major bottleneck for learning networks for these tasks. In this work, we thus study the problem of self-supervised representation learning for unintentional action prediction. While previous works learn the representation based on a local temporal neighborhood, we show that the global context of a video is needed to learn a good representation for the three downstream tasks: unintentional action classification, localization and anticipation. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.

* Accepted to GCPR 2022

Via

Access Paper or Ask Questions

One-Shot Synthesis of Images and Segmentation Masks

Sep 15, 2022
Vadim Sushko, Dan Zhang, Juergen Gall, Anna Khoreva

Figure 1 for One-Shot Synthesis of Images and Segmentation Masks

Figure 2 for One-Shot Synthesis of Images and Segmentation Masks

Figure 3 for One-Shot Synthesis of Images and Segmentation Masks

Figure 4 for One-Shot Synthesis of Images and Segmentation Masks

Joint synthesis of images and segmentation masks with generative adversarial networks (GANs) is promising to reduce the effort needed for collecting image data with pixel-wise annotations. However, to learn high-fidelity image-mask synthesis, existing GAN approaches first need a pre-training phase requiring large amounts of image data, which limits their utilization in restricted image domains. In this work, we take a step to reduce this limitation, introducing the task of one-shot image-mask synthesis. We aim to generate diverse images and their segmentation masks given only a single labelled example, and assuming, contrary to previous models, no access to any pre-training data. To this end, inspired by the recent architectural developments of single-image GANs, we introduce our OSMIS model which enables the synthesis of segmentation masks that are precisely aligned to the generated images in the one-shot regime. Besides achieving the high fidelity of generated masks, OSMIS outperforms state-of-the-art single-image GAN models in image synthesis quality and diversity. In addition, despite not using any additional data, OSMIS demonstrates an impressive ability to serve as a source of useful data augmentation for one-shot segmentation applications, providing performance gains that are complementary to standard data augmentation techniques. Code is available at https://github.com/ boschresearch/one-shot-synthesis

* Accepted as a conference paper at IEEE Winter Conference on Applications of Computer Vision (WACV) 2023

Via

Access Paper or Ask Questions

Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation

Sep 01, 2022
Nadine Behrmann, S. Alireza Golestaneh, Zico Kolter, Juergen Gall, Mehdi Noroozi

Figure 1 for Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation

Figure 2 for Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation

Figure 3 for Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation

Figure 4 for Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation

This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to current state-of-the-art frame-level prediction methods, we view action segmentation as a seq2seq translation task, i.e., mapping a sequence of video frames to a sequence of action segments. Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model to cope with long input sequences opposed to short output sequences and relatively few videos. We incorporate an auxiliary supervision signal for the encoder via a frame-wise loss and propose a separate alignment decoder for an implicit duration prediction. Finally, we extend our framework to the timestamp supervised setting via our proposed constrained k-medoids algorithm to generate pseudo-segmentations. Our proposed framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets.

* ECCV 2022 (Main Conference)

Via

Access Paper or Ask Questions