Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Didier Stricker

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

Nov 30, 2023

Shishir Muralidhara, Sravan Kumar Jagadeesh, René Schuster, Didier Stricker

Abstract:Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our Joint Panoptic Part Fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: First, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameter-free and dynamically balances its input. The method is evaluated and compared on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets.

* Accepted for Springer Nature Computer Science. arXiv admin note: substantial text overlap with arXiv:2212.07671

Via

Access Paper or Ask Questions

ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map

Oct 18, 2023

Ahmed Tawfik Aboukhadra, Jameel Malik, Nadia Robertini, Ahmed Elhayek, Didier Stricker

Figure 1 for ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map

Figure 2 for ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map

Figure 3 for ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map

Figure 4 for ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map

Abstract:3D reconstruction of hand-object manipulations is important for emulating human actions. Most methods dealing with challenging object manipulation scenarios, focus on hands reconstruction in isolation, ignoring physical and kinematic constraints due to object contact. Some approaches produce more realistic results by jointly reconstructing 3D hand-object interactions. However, they focus on coarse pose estimation or rely upon known hand and object shapes. We propose the first approach for realistic 3D hand-object shape and pose reconstruction from a single depth map. Unlike previous work, our voxel-based reconstruction network regresses the vertex coordinates of a hand and an object and reconstructs more realistic interaction. Our pipeline additionally predicts voxelized hand-object shapes, having a one-to-one mapping to the input voxelized depth. Thereafter, we exploit the graph nature of the hand and object shapes, by utilizing the recent GraFormer network with positional embedding to reconstruct shapes from template meshes. In addition, we show the impact of adding another GraFormer component that refines the reconstructed shapes based on the hand-object interactions and its ability to reconstruct more accurate object shapes. We perform an extensive evaluation on the HO-3D and DexYCB datasets and show that our method outperforms existing approaches in hand reconstruction and produces plausible reconstructions for the objects

Via

Access Paper or Ask Questions

Cross-Dataset Experimental Study of Radar-Camera Fusion in Bird's-Eye View

Sep 27, 2023

Lukas Stäcker, Philipp Heidenreich, Jason Rambach, Didier Stricker

Figure 1 for Cross-Dataset Experimental Study of Radar-Camera Fusion in Bird's-Eye View

Figure 2 for Cross-Dataset Experimental Study of Radar-Camera Fusion in Bird's-Eye View

Figure 3 for Cross-Dataset Experimental Study of Radar-Camera Fusion in Bird's-Eye View

Figure 4 for Cross-Dataset Experimental Study of Radar-Camera Fusion in Bird's-Eye View

Abstract:By exploiting complementary sensor information, radar and camera fusion systems have the potential to provide a highly robust and reliable perception system for advanced driver assistance systems and automated driving functions. Recent advances in camera-based object detection offer new radar-camera fusion possibilities with bird's eye view feature maps. In this work, we propose a novel and flexible fusion network and evaluate its performance on two datasets: nuScenes and View-of-Delft. Our experiments reveal that while the camera branch needs large and diverse training data, the radar branch benefits more from a high-performance radar. Using transfer learning, we improve the camera's performance on the smaller dataset. Our results further demonstrate that the radar-camera fusion approach significantly outperforms the camera-only and radar-only baselines.

* EUSIPCO 2023

Via

Access Paper or Ask Questions

Introducing Language Guidance in Prompt-based Continual Learning

Aug 30, 2023

Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, Muhammad Zeshan Afzal

Abstract:Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive when the number of tasks becomes large or data can not be stored for privacy reasons. As an alternative, prompt-based methods have been proposed that store the task information in a learnable prompt pool. This prompt pool instructs a frozen image encoder on how to solve each task. While the model faces a disjoint set of classes in each task in this setting, we argue that these classes can be encoded to the same embedding space of a pre-trained language encoder. In this work, we propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods. LGCL is model agnostic and introduces language guidance at the task level in the prompt pool and at the class level on the output feature of the vision encoder. We show with extensive experimentation that LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the art. LGCL achieves these performance improvements without needing any additional learnable parameters.

* Accepted at ICCV 2023

Via

Access Paper or Ask Questions

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Aug 21, 2023

Nikolas Ebert, Didier Stricker, Oliver Wasenmüller

Figure 1 for Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Figure 2 for Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Figure 3 for Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Figure 4 for Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Abstract:Many medical or pharmaceutical processes have strict guidelines regarding continuous hygiene monitoring. This often involves the labor-intensive task of manually counting microorganisms in Petri dishes by trained personnel. Automation attempts often struggle due to major challenges: significant scaling differences, low separation, low contrast, etc. To address these challenges, we introduce AttnPAFPN, a high-resolution detection pipeline that leverages a novel transformer variation, the efficient-global self-attention mechanism. Our streamlined approach can be easily integrated in almost any multi-scale object detection pipeline. In a comprehensive evaluation on the publicly available AGAR dataset, we demonstrate the superior accuracy of our network over the current state-of-the-art. In order to demonstrate the task-independent performance of our approach, we perform further experiments on COCO and LIVECell datasets.

* This paper has been accepted at IEEE International Conference on Computer Vision Workshops (ICCV workshop), 2023

Via

Access Paper or Ask Questions

DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport

Aug 14, 2023

Sk Aziz Ali, Djamila Aouada, Gerd Reis, Didier Stricker

Figure 1 for DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport

Figure 2 for DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport

Figure 3 for DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport

Figure 4 for DELO: Deep Evidential LiDAR Odometry using Partial Optimal Transport

Abstract:Accurate, robust, and real-time LiDAR-based odometry (LO) is imperative for many applications like robot navigation, globally consistent 3D scene map reconstruction, or safe motion-planning. Though LiDAR sensor is known for its precise range measurement, the non-uniform and uncertain point sampling density induce structural inconsistencies. Hence, existing supervised and unsupervised point set registration methods fail to establish one-to-one matching correspondences between LiDAR frames. We introduce a novel deep learning-based real-time (approx. 35-40ms per frame) LO method that jointly learns accurate frame-to-frame correspondences and model's predictive uncertainty (PU) as evidence to safe-guard LO predictions. In this work, we propose (i) partial optimal transportation of LiDAR feature descriptor for robust LO estimation, (ii) joint learning of predictive uncertainty while learning odometry over driving sequences, and (iii) demonstrate how PU can serve as evidence for necessary pose-graph optimization when LO network is either under or over confident. We evaluate our method on KITTI dataset and show competitive performance, even superior generalization ability over recent state-of-the-art approaches. Source codes are available.

* Accepted in ICCV 2023 Workshop

Via

Access Paper or Ask Questions

U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Aug 11, 2023

Yan Di, Chenyangguang Zhang, Ruida Zhang, Fabian Manhardt, Yongzhi Su, Jason Rambach, Didier Stricker, Xiangyang Ji, Federico Tombari

Figure 1 for U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Figure 2 for U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Figure 3 for U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Figure 4 for U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Abstract:In this paper, we propose U-RED, an Unsupervised shape REtrieval and Deformation pipeline that takes an arbitrary object observation as input, typically captured by RGB images or scans, and jointly retrieves and deforms the geometrically similar CAD models from a pre-established database to tightly match the target. Considering existing methods typically fail to handle noisy partial observations, U-RED is designed to address this issue from two aspects. First, since one partial shape may correspond to multiple potential full shapes, the retrieval method must allow such an ambiguous one-to-many relationship. Thereby U-RED learns to project all possible full shapes of a partial target onto the surface of a unit sphere. Then during inference, each sampling on the sphere will yield a feasible retrieval. Second, since real-world partial observations usually contain noticeable noise, a reliable learned metric that measures the similarity between shapes is necessary for stable retrieval. In U-RED, we design a novel point-wise residual-guided metric that allows noise-robust comparison. Extensive experiments on the synthetic datasets PartNet, ComplementMe and the real-world dataset Scan2CAD demonstrate that U-RED surpasses existing state-of-the-art approaches by 47.3%, 16.7% and 31.6% respectively under Chamfer Distance.

* ICCV2023

Via

Access Paper or Ask Questions

FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

Aug 07, 2023

Khurram Azeem Hashmi, Goutham Kallempudi, Didier Stricker, Muhammamd Zeshan Afzal

Figure 1 for FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

Figure 2 for FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

Figure 3 for FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

Figure 4 for FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

Abstract:Extracting useful visual cues for the downstream tasks is especially challenging under low-light vision. Prior works create enhanced representations by either correlating visual quality with machine perception or designing illumination-degrading transformation methods that require pre-training on synthetic datasets. We argue that optimizing enhanced image representation pertaining to the loss of the downstream task can result in more expressive representations. Therefore, in this work, we propose a novel module, FeatEnHancer, that hierarchically combines multiscale features using multiheaded attention guided by task-related loss function to create suitable representations. Furthermore, our intra-scale enhancement improves the quality of features extracted at each scale or level, as well as combines features from different scales in a way that reflects their relative importance for the task at hand. FeatEnHancer is a general-purpose plug-and-play module and can be incorporated into any low-light vision pipeline. We show with extensive experimentation that the enhanced representation produced with FeatEnHancer significantly and consistently improves results in several low-light vision tasks, including dark object detection (+5.7 mAP on ExDark), face detection (+1.5 mAPon DARK FACE), nighttime semantic segmentation (+5.1 mIoU on ACDC ), and video object detection (+1.8 mAP on DarkVision), highlighting the effectiveness of enhancing hierarchical features under low-light vision.

* 19 pages, 9 Figures, and 10 Tables. Accepted at ICCV2023

Via

Access Paper or Ask Questions

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Jul 18, 2023

Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver Wasenmüller

Figure 1 for Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Figure 2 for Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Figure 3 for Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Figure 4 for Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Abstract:While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.

* This paper has been accepted at IEEE Intelligent Transportation Systems Conference (ITSC), 2023

Via

Access Paper or Ask Questions

Achieving RGB-D level Segmentation Performance from a Single ToF Camera

Jun 30, 2023

Pranav Sharma, Jigyasa Singh Katrolia, Jason Rambach, Bruno Mirbach, Didier Stricker, Juergen Seiler

Figure 1 for Achieving RGB-D level Segmentation Performance from a Single ToF Camera

Figure 2 for Achieving RGB-D level Segmentation Performance from a Single ToF Camera

Figure 3 for Achieving RGB-D level Segmentation Performance from a Single ToF Camera

Figure 4 for Achieving RGB-D level Segmentation Performance from a Single ToF Camera

Abstract:Depth is a very important modality in computer vision, typically used as complementary information to RGB, provided by RGB-D cameras. In this work, we show that it is possible to obtain the same level of accuracy as RGB-D cameras on a semantic segmentation task using infrared (IR) and depth images from a single Time-of-Flight (ToF) camera. In order to fuse the IR and depth modalities of the ToF camera, we introduce a method utilizing depth-specific convolutions in a multi-task learning framework. In our evaluation on an in-car segmentation dataset, we demonstrate the competitiveness of our method against the more costly RGB-D approaches.

Via

Access Paper or Ask Questions