Loop closure is an important component of Simultaneous Localization and Mapping (SLAM) systems. Large Field-of-View (FoV) cameras have received extensive attention in the SLAM field as they can exploit more surrounding features on the panoramic image. In large-FoV VIO, for incorporating the informative cues located on the negative plane of the panoramic lens, image features are represented by a three-dimensional vector with a unit length. While the panoramic FoV is seemingly advantageous for loop closure, the benefits cannot easily be materialized under large-attitude-angle differences, where loop-closure frames can hardly be matched by existing methods. In this work, to fully unleash the potential of ultra-wide FoV, we propose to leverage the attitude information of a VIO system to guide the feature point detection of the loop closure. As loop closure on wide-FoV panoramic data further comes with a large number of outliers, traditional outlier rejection methods are not directly applicable. To tackle this issue, we propose a loop closure framework with a new outlier rejection method based on the unit length representation, to improve the accuracy of LF-VIO. On the public PALVIO dataset, a comprehensive set of experiments is carried out and the proposed LF-VIO-Loop outperforms state-of-the-art visual-inertial-odometry methods. Our code will be open-sourced at https://github.com/flysoaryun/LF-VIO-Loop.
In this paper, we address panoramic semantic segmentation, which provides a full-view and dense-pixel understanding of surroundings in a holistic way. Panoramic segmentation is under-explored due to two critical challenges: (1) image distortions and object deformations on panoramas; (2) lack of annotations for training panoramic segmenters. To tackle these problems, we propose a Transformer for Panoramic Semantic Segmentation (Trans4PASS) architecture. First, to enhance distortion awareness, Trans4PASS, equipped with Deformable Patch Embedding (DPE) and Deformable MLP (DMLP) modules, is capable of handling object deformations and image distortions whenever (before or after adaptation) and wherever (shallow or deep levels) by design. We further introduce the upgraded Trans4PASS+ model, featuring DMLPv2 with parallel token mixing to improve the flexibility and generalizability in modeling discriminative cues. Second, we propose a Mutual Prototypical Adaptation (MPA) strategy for unsupervised domain adaptation. Third, aside from Pinhole-to-Panoramic (Pin2Pan) adaptation, we create a new dataset (SynPASS) with 9,080 panoramic images to explore a Synthetic-to-Real (Syn2Real) adaptation scheme in 360{\deg} imagery. Extensive experiments are conducted, which cover indoor and outdoor scenarios, and each of them is investigated with Pin2Pan and Syn2Real regimens. Trans4PASS+ achieves state-of-the-art performances on four domain adaptive panoramic semantic segmentation benchmarks. Code is available at https://github.com/jamycheung/Trans4PASS.
Humans have an innate ability to sense their surroundings, as they can extract the spatial representation from the egocentric perception and form an allocentric semantic map via spatial transformation and memory updating. However, endowing mobile agents with such a spatial sensing ability is still a challenge, due to two difficulties: (1) the previous convolutional models are limited by the local receptive field, thus, struggling to capture holistic long-range dependencies during observation; (2) the excessive computational budgets required for success, often lead to a separation of the mapping pipeline into stages, resulting the entire mapping process inefficient. To address these issues, we propose an end-to-end one-stage Transformer-based framework for Mapping, termed Trans4Map. Our egocentric-to-allocentric mapping process includes three steps: (1) the efficient transformer extracts the contextual features from a batch of egocentric images; (2) the proposed Bidirectional Allocentric Memory (BAM) module projects egocentric features into the allocentric memory; (3) the map decoder parses the accumulated memory and predicts the top-down semantic segmentation map. In contrast, Trans4Map achieves state-of-the-art results, reducing 67.2% parameters, yet gaining a +3.25% mIoU and a +4.09% mBF1 improvements on the Matterport3D dataset. Code will be made publicly available at https://github.com/jamycheung/Trans4Map.
Failure to timely diagnose and effectively treat depression leads to over 280 million people suffering from this psychological disorder worldwide. The information cues of depression can be harvested from diverse heterogeneous resources, e.g., audio, visual, and textual data, raising demand for new effective multi-modal fusion approaches for its automatic estimation. In this work, we tackle the task of automatically identifying depression from multi-modal data and introduce a sub-attention mechanism for linking heterogeneous information while leveraging Convolutional Bidirectional LSTM as our backbone. To validate this idea, we conduct extensive experiments on the public DAIC-WOZ benchmark for depression assessment featuring different evaluation modes and taking gender-specific biases into account. The proposed model yields effective results with 0.89 precision and 0.70 F1-score in detecting major depression and 4.92 MAE in estimating the severity. Our attention-based fusion module consistently outperforms conventional late fusion approaches and achieves a competitive performance compared to the previously published depression estimation frameworks, while learning to diagnose the disorder end-to-end and relying on far less preprocessing steps.
In this work, we introduce panoramic panoptic segmentation, as the most holistic scene understanding, both in terms of Field of View (FoV) and image-level understanding for standard camera-based input. A complete surrounding understanding provides a maximum of information to a mobile agent, which is essential for any intelligent vehicle in order to make informed decisions in a safety-critical dynamic environment such as real-world traffic. In order to overcome the lack of annotated panoramic images, we propose a framework which allows model training on standard pinhole images and transfers the learned features to a different domain in a cost-minimizing way. Using our proposed method with dense contrastive learning, we manage to achieve significant improvements over a non-adapted approach. Depending on the efficient panoptic segmentation architecture, we can improve 3.5-6.5% measured in Panoptic Quality (PQ) over non-adapted models on our established Wild Panoramic Panoptic Segmentation (WildPPS) dataset. Furthermore, our efficient framework does not need access to the images of the target domain, making it a feasible domain generalization approach suitable for a limited hardware setting. As additional contributions, we publish WildPPS: The first panoramic panoptic image dataset to foster progress in surrounding perception and explore a novel training procedure combining supervised and contrastive training.
Panoramic Annular Lens (PAL), composed of few lenses, has great potential in panoramic surrounding sensing tasks for mobile and wearable devices because of its tiny size and large Field of View (FoV). However, the image quality of tiny-volume PAL confines to optical limit due to the lack of lenses for aberration correction. In this paper, we propose an Annular Computational Imaging (ACI) framework to break the optical limit of light-weight PAL design. To facilitate learning-based image restoration, we introduce a wave-based simulation pipeline for panoramic imaging and tackle the synthetic-to-real gap through multiple data distributions. The proposed pipeline can be easily adapted to any PAL with design parameters and is suitable for loose-tolerance designs. Furthermore, we design the Physics Informed Image Restoration Network (PI2RNet), considering the physical priors of panoramic imaging and physics-informed learning. At the dataset level, we create the DIVPano dataset and the extensive experiments on it illustrate that our proposed network sets the new state of the art in the panoramic image restoration under spatially-variant degradation. In addition, the evaluation of the proposed ACI on a simple PAL with only 3 spherical lenses reveals the delicate balance between high-quality panoramic imaging and compact design. To the best of our knowledge, we are the first to explore Computational Imaging (CI) in PAL. Code and datasets will be made publicly available at https://github.com/zju-jiangqi/ACI-PI2RNet.
Human Pose Estimation (HPE) based on RGB images has experienced a rapid development benefiting from deep learning. However, event-based HPE has not been fully studied, which remains great potential for applications in extreme scenes and efficiency-critical conditions. In this paper, we are the first to estimate 2D human pose directly from 3D event point cloud. We propose a novel representation of events, the rasterized event point cloud, aggregating events on the same position of a small time slice. It maintains the 3D features from multiple statistical cues and significantly reduces memory consumption and computation complexity, proved to be efficient in our work. We then leverage the rasterized event point cloud as input to three different backbones, PointNet, DGCNN, and Point Transformer, with two linear layer decoders to predict the location of human keypoints. We find that based on our method, PointNet achieves promising results with much faster speed, whereas Point Transfomer reaches much higher accuracy, even close to previous event-frame-based methods. A comprehensive set of results demonstrates that our proposed method is consistently effective for these 3D backbone models in event-driven human pose estimation. Our method based on PointNet with 2048 points input achieves 82.46mm in MPJPE3D on the DHP19 dataset, while only has a latency of 12.29ms on an NVIDIA Jetson Xavier NX edge computing platform, which is ideally suitable for real-time detection with event cameras. Code will be made publicly at https://github.com/MasterHow/EventPointPose.
With the rapid development of high-speed communication and artificial intelligence technologies, human perception of real-world scenes is no longer limited to the use of small Field of View (FoV) and low-dimensional scene detection devices. Panoramic imaging emerges as the next generation of innovative intelligent instruments for environmental perception and measurement. However, while satisfying the need for large-FoV photographic imaging, panoramic imaging instruments are expected to have high resolution, no blind area, miniaturization, and multi-dimensional intelligent perception, and can be combined with artificial intelligence methods towards the next generation of intelligent instruments, enabling deeper understanding and more holistic perception of 360-degree real-world surrounding environments. Fortunately, recent advances in freeform surfaces, thin-plate optics, and metasurfaces provide innovative approaches to address human perception of the environment, offering promising ideas beyond conventional optical imaging. In this review, we begin with introducing the basic principles of panoramic imaging systems, and then describe the architectures, features, and functions of various panoramic imaging systems. Afterwards, we discuss in detail the broad application prospects and great design potential of freeform surfaces, thin-plate optics, and metasurfaces in panoramic imaging. We then provide a detailed analysis on how these techniques can help enhance the performance of panoramic imaging systems. We further offer a detailed analysis of applications of panoramic imaging in scene understanding for autonomous driving and robotics, spanning panoramic semantic image segmentation, panoramic depth estimation, panoramic visual localization, and so on. Finally, we cast a perspective on future potential and research directions for panoramic imaging instruments.
Structured Visual Content (SVC) such as graphs, flow charts, or the like are used by authors to illustrate various concepts. While such depictions allow the average reader to better understand the contents, images containing SVCs are typically not machine-readable. This, in turn, not only hinders automated knowledge aggregation, but also the perception of displayed in-formation for visually impaired people. In this work, we propose a synthetic dataset, containing SVCs in the form of images as well as ground truths. We show the usage of this dataset by an application that automatically extracts a graph representation from an SVC image. This is done by training a model via common supervised learning methods. As there currently exist no large-scale public datasets for the detailed analysis of SVC, we propose the Synthetic SVC (SSVC) dataset comprising 12,000 images with respective bounding box annotations and detailed graph representations. Our dataset enables the development of strong models for the interpretation of SVCs while skipping the time-consuming dense data annotation. We evaluate our model on both synthetic and manually annotated data and show the transferability of synthetic to real via various metrics, given the presented application. Here, we evaluate that this proof of concept is possible to some extend and lay down a solid baseline for this task. We discuss the limitations of our approach for further improvements. Our utilized metrics can be used as a tool for future comparisons in this domain. To enable further research on this task, the dataset is publicly available at https://bit.ly/3jN1pJJ
Pixel-wise semantic segmentation of RGB images can be advanced by exploiting informative features from supplementary modalities. In this work, we propose CMX, a vision-transformer-based cross-modal fusion framework for RGB-X semantic segmentation. To generalize to different sensing modalities encompassing various supplements and uncertainties, we consider that comprehensive cross-modal interactions should be provided. CMX is built with two streams to extract features from RGB images and the complementary modality (X-modality). In each feature extraction stage, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate the feature of the current modality by combining the feature from the other modality, in spatial- and channel-wise dimensions. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to mix them for the final semantic prediction. FFM is constructed with a cross-attention mechanism, which enables exchange of long-range contexts, enhancing both modalities' features at a global level. Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal and RGB-Polarization datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. Code is available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.