Alert button
Picture for Sina Honari

Sina Honari

Alert button

AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers

Dec 29, 2022
Krzysztof Lis, Matthias Rottmann, Sina Honari, Pascal Fua, Mathieu Salzmann

Figure 1 for AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers
Figure 2 for AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers
Figure 3 for AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers
Figure 4 for AttEntropy: Segmenting Unknown Objects in Complex Scenes using the Spatial Attention Entropy of Semantic Segmentation Transformers

Vision transformers have emerged as powerful tools for many computer vision tasks. It has been shown that their features and class tokens can be used for salient object segmentation. However, the properties of segmentation transformers remain largely unstudied. In this work we conduct an in-depth study of the spatial attentions of different backbone layers of semantic segmentation transformers and uncover interesting properties. The spatial attentions of a patch intersecting with an object tend to concentrate within the object, whereas the attentions of larger, more uniform image areas rather follow a diffusive behavior. In other words, vision transformers trained to segment a fixed set of object classes generalize to objects well beyond this set. We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds, such as obstacles in traffic scenes. Our method is training-free and its computational overhead negligible. We use off-the-shelf transformers trained for street-scene segmentation to process other scene types.

Viaarxiv icon

Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests

Dec 03, 2022
Christopher Beckham, Martin Weiss, Florian Golemo, Sina Honari, Derek Nowrouzezahrai, Christopher Pal

Figure 1 for Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests
Figure 2 for Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests
Figure 3 for Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests
Figure 4 for Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests

Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. We explore a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint. To do this we have created a new version of the CLEVR dataset that we call CLEVR Mental Rotation Tests (CLEVR-MRT). Using CLEVR-MRT we examine standard methods, show how they fall short, then explore novel neural architectures that involve inferring volumetric representations of a scene. These volumes can be manipulated via camera-conditioned transformations to answer the question. We examine the efficacy of different model variants through rigorous ablations and demonstrate the efficacy of volumetric representations.

* Accepted for publication to Pattern Recognition journal 
Viaarxiv icon

Unsupervised 3D Keypoint Estimation with Multi-View Geometry

Nov 23, 2022
Sina Honari, Pascal Fua

Figure 1 for Unsupervised 3D Keypoint Estimation with Multi-View Geometry
Figure 2 for Unsupervised 3D Keypoint Estimation with Multi-View Geometry
Figure 3 for Unsupervised 3D Keypoint Estimation with Multi-View Geometry
Figure 4 for Unsupervised 3D Keypoint Estimation with Multi-View Geometry

Given enough annotated training data, 3D human pose estimation models can achieve high accuracy. However, annotations are not always available, especially for people performing unusual activities. In this paper, we propose an algorithm that learns to detect 3D keypoints on human bodies from multiple-views without any supervision other than the constraints multiple-view geometry provides. To ensure that the estimated 3D keypoints are meaningful, they are re-projected to each view to estimate the person's mask that the model itself has initially estimated. Our approach outperforms other state-of-the-art unsupervised 3D human pose estimation methods on the Human3.6M and MPI-INF-3DHP benchmark datasets.

Viaarxiv icon

Perspective Aware Road Obstacle Detection

Oct 04, 2022
Krzysztof Lis, Sina Honari, Pascal Fua, Mathieu Salzmann

Figure 1 for Perspective Aware Road Obstacle Detection
Figure 2 for Perspective Aware Road Obstacle Detection
Figure 3 for Perspective Aware Road Obstacle Detection
Figure 4 for Perspective Aware Road Obstacle Detection

While road obstacle detection techniques have become increasingly effective, they typically ignore the fact that, in practice, the apparent size of the obstacles decreases as their distance to the vehicle increases. In this paper, we account for this by computing a scale map encoding the apparent size of a hypothetical object at every image location. We then leverage this perspective map to (i) generate training data by injecting synthetic objects onto the road in a more realistic fashion than existing methods; and (ii) incorporate perspective information in the decoding part of the detection network to guide the obstacle detector. Our results on standard benchmarks show that, together, these two strategies significantly boost the obstacle detection performance, allowing our approach to consistently outperform state-of-the-art methods in terms of instance-level obstacle detection.

Viaarxiv icon

On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation

Mar 29, 2022
Soumava Kumar Roy, Leonardo Citraro, Sina Honari, Pascal Fua

Figure 1 for On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation
Figure 2 for On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation
Figure 3 for On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation
Figure 4 for On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation

Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. Therefore, much of the recent attention has shifted towards semi and (or) weakly supervised learning. Generating an effective form of supervision with little annotations still poses major challenges in crowded scenes. However, since it is easy to observe a scene from multiple cameras, we propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available. We therefore train a 2D pose estimator in such a way that its predictions correspond to the re-projection of the triangulated 3D one and train an auxiliary network on them to produce the final 3D poses. We complement the triangulation with a weighting mechanism that nullify the impact of noisy predictions caused by self-occlusion or occlusion from other subjects. Our experimental results on Human3.6M and MPI-INF-3DHP substantiate the significance of our weighting strategy where we obtain state-of-the-art results in the semi and weakly supervised learning setup. We also contribute a new multi-player sports dataset that features occlusion, and show the effectiveness of our algorithm over baseline triangulation methods.

Viaarxiv icon

Overcoming the Domain Gap in Neural Action Representations

Dec 23, 2021
Semih Günel, Florian Aymanns, Sina Honari, Pavan Ramdya, Pascal Fua

Figure 1 for Overcoming the Domain Gap in Neural Action Representations
Figure 2 for Overcoming the Domain Gap in Neural Action Representations
Figure 3 for Overcoming the Domain Gap in Neural Action Representations
Figure 4 for Overcoming the Domain Gap in Neural Action Representations

Relating animal behaviors to brain activity is a fundamental goal in neuroscience, with practical applications in building robust brain-machine interfaces. However, the domain gap between individuals is a major issue that prevents the training of general models that work on unlabeled subjects. Since 3D pose data can now be reliably extracted from multi-view video sequences without manual intervention, we propose to use it to guide the encoding of neural action representations together with a set of neural and behavioral augmentations exploiting the properties of microscopy imaging. To reduce the domain gap, during training, we swap neural and behavioral data across animals that seem to be performing similar actions. To demonstrate this, we test our methods on three very different multimodal datasets; one that features flies and their neural activity, one that contains human neural Electrocorticography (ECoG) data, and lastly the RGB video data of human activities from different viewpoints.

Viaarxiv icon

Adversarial Parametric Pose Prior

Dec 08, 2021
Andrey Davydov, Anastasia Remizova, Victor Constantin, Sina Honari, Mathieu Salzmann, Pascal Fua

Figure 1 for Adversarial Parametric Pose Prior
Figure 2 for Adversarial Parametric Pose Prior
Figure 3 for Adversarial Parametric Pose Prior
Figure 4 for Adversarial Parametric Pose Prior

The Skinned Multi-Person Linear (SMPL) model can represent a human body by mapping pose and shape parameters to body meshes. This has been shown to facilitate inferring 3D human pose and shape from images via different learning models. However, not all pose and shape parameter values yield physically-plausible or even realistic body meshes. In other words, SMPL is under-constrained and may thus lead to invalid results when used to reconstruct humans from images, either by directly optimizing its parameters, or by learning a mapping from the image to these parameters. In this paper, we therefore learn a prior that restricts the SMPL parameters to values that produce realistic poses via adversarial training. We show that our learned prior covers the diversity of the real-data distribution, facilitates optimization for 3D reconstruction from 2D keypoints, and yields better pose estimates when used for regression from images. We found that the prior based on spherical distribution gets the best results. Furthermore, in all these tasks, it outperforms the state-of-the-art VAE-based approach to constraining the SMPL parameters.

Viaarxiv icon

Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations

Nov 29, 2021
Semih Günel, Florian Aymanns, Sina Honari, Pavan Ramdya, Pascal Fua

Figure 1 for Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations

A fundamental goal in neuroscience is to understand the relationship between neural activity and behavior. For example, the ability to extract behavioral intentions from neural data, or neural decoding, is critical for developing effective brain machine interfaces. Although simple linear models have been applied to this challenge, they cannot identify important non-linear relationships. Thus, a self-supervised means of identifying non-linear relationships between neural dynamics and behavior, in order to compute neural representations, remains an important open problem. To address this challenge, we generated a new multimodal dataset consisting of the spontaneous behaviors generated by fruit flies, Drosophila melanogaster -- a popular model organism in neuroscience research. The dataset includes 3D markerless motion capture data from six camera views of the animal generating spontaneous actions, as well as synchronously acquired two-photon microscope images capturing the activity of descending neuron populations that are thought to drive actions. Standard contrastive learning and unsupervised domain adaptation techniques struggle to learn neural action representations (embeddings computed from the neural data describing action labels) due to large inter-animal differences in both neural and behavioral modalities. To overcome this deficiency, we developed simple yet effective augmentations that close the inter-animal domain gap, allowing us to extract behaviorally relevant, yet domain agnostic, information from neural data. This multimodal dataset and our new set of augmentations promise to accelerate the application of self-supervised learning methods in neuroscience.

* Accepted into NeurIPS 2021 Workshop: Self-Supervised Learning - Theory and Practice 
Viaarxiv icon

SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation

Apr 30, 2021
Robin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Mathieu Salzmann, Pascal Fua, Matthias Rottmann

Figure 1 for SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation
Figure 2 for SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation
Figure 3 for SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation
Figure 4 for SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation

State-of-the-art semantic or instance segmentation deep neural networks (DNNs) are usually trained on a closed set of semantic classes. As such, they are ill-equipped to handle previously-unseen objects. However, detecting and localizing such objects is crucial for safety-critical applications such as perception for automated driving, especially if they appear on the road ahead. While some methods have tackled the tasks of anomalous or out-of-distribution object segmentation, progress remains slow, in large part due to the lack of solid benchmarks; existing datasets either consist of synthetic data, or suffer from label inconsistencies. In this paper, we bridge this gap by introducing the "SegmentMeIfYouCan" benchmark. Our benchmark addresses two tasks: Anomalous object segmentation, which considers any previously-unseen object category; and road obstacle segmentation, which focuses on any object on the road, may it be known or unknown. We provide two corresponding datasets together with a test suite performing an in-depth method analysis, considering both established pixel-wise performance metrics and recent component-wise ones, which are insensitive to object sizes. We empirically evaluate multiple state-of-the-art baseline methods, including several specifically designed for anomaly / obstacle segmentation, on our datasets as well as on public ones, using our benchmark suite. The anomaly and obstacle segmentation results show that our datasets contribute to the diversity and challengingness of both dataset landscapes.

* 10 pages, 13 figures, website http://www.segmentmeifyoucan.com/ 
Viaarxiv icon