Simultaneous localization and mapping (SLAM) stands as one of the critical challenges in robot navigation. Recent advancements suggest that methods based on supervised learning deliver impressive performance in front-end odometry, while traditional optimization-based methods still play a vital role in the back-end for minimizing estimation drift. In this paper, we found that such decoupled paradigm can lead to only sub-optimal performance, consequently curtailing system capabilities and generalization potential. To solve this problem, we proposed a novel self-supervised learning framework, imperative SLAM (iSLAM), which fosters reciprocal correction between the front-end and back-end, thus enhancing performance without necessitating any external supervision. Specifically, we formulate a SLAM system as a bi-level optimization problem so that the two components are bidirectionally connected. As a result, the front-end model is able to learn global geometric knowledge obtained through pose graph optimization by back-propagating the residuals from the back-end. This significantly improves the generalization ability of the entire system and thus achieves the accuracy improvement up to 45%. To the best of our knowledge, iSLAM is the first SLAM system showing that the front-end and back-end can learn jointly and mutually contribute to each other in a self-supervised manner.
Detecting unseen instances based on multi-view templates is a challenging problem due to its open-world nature. Traditional methodologies, which primarily rely on 2D representations and matching techniques, are often inadequate in handling pose variations and occlusions. To solve this, we introduce VoxDet, a pioneer 3D geometry-aware framework that fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism. VoxDet first ingeniously proposes template voxel aggregation (TVA) module, effectively transforming multi-view 2D images into 3D voxel features. By leveraging associated camera poses, these features are aggregated into a compact 3D template voxel. In novel instance detection, this voxel representation demonstrates heightened resilience to occlusion and pose variations. We also discover that a 3D reconstruction objective helps to pre-train the 2D-3D mapping in TVA. Second, to quickly align with the template voxel, VoxDet incorporates a Query Voxel Matching (QVM) module. The 2D queries are first converted into their voxel representation with the learned 2D-3D mapping. We find that since the 3D voxel representations encode the geometry, we can first estimate the relative rotation and then compare the aligned voxels, leading to improved accuracy and efficiency. Exhaustive experiments are conducted on the demanding LineMod-Occlusion, YCB-video, and the newly built RoboTools benchmarks, where VoxDet outperforms various 2D baselines remarkably with 20% higher recall and faster speed. To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D detection tasks.
Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively in recent years. In the single-pass setting with $K$ arms and $T$ trials, a regret lower bound of $\Omega(T^{2/3})$ has been proved for any algorithm with $o(K)$ memory (Maiti et al. [NeurIPS'21]; Agarwal at al. [COLT'22]). On the other hand, however, the previous best regret upper bound is still $O(K^{1/3} T^{2/3}\log^{1/3}(T))$, which is achieved by the streaming implementation of the simple uniform exploration. The $O(K^{1/3}\log^{1/3}(T))$ gap leaves the open question of the tight regret bound in the single-pass MABs with sublinear arm memory. In this paper, we answer this open problem and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to $\Omega(K^{1/3}T^{2/3})$ for algorithms with $o(K)$ memory, which matches the uniform exploration regret up to a logarithm factor in $T$. We then show that the $\log^{1/3}(T)$ factor is not necessary, and we can achieve $O(K^{1/3}T^{2/3})$ regret by finding an $\varepsilon$-best arm and committing to it in the rest of the trials. For regret minimization with high constant probability, we can apply the single-memory $\varepsilon$-best arm algorithms in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the expected regret minimization, we design an algorithm with a single-arm memory that achieves $O(K^{1/3} T^{2/3}\log(K))$ regret, and an algorithm with $O(\log^{*}(n))$-memory with the optimal $O(K^{1/3} T^{2/3})$ regret following the $\varepsilon$-best arm algorithm in Assadi and Wang [STOC'20]. We further tested the empirical performances of our algorithms. The simulation results show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.
Achieving an immersive experience enabling users to explore virtual environments with six degrees of freedom (6DoF) is essential for various applications such as virtual reality (VR). Wide-baseline panoramas are commonly used in these applications to reduce network bandwidth and storage requirements. However, synthesizing novel views from these panoramas remains a key challenge. Although existing neural radiance field methods can produce photorealistic views under narrow-baseline and dense image captures, they tend to overfit the training views when dealing with \emph{wide-baseline} panoramas due to the difficulty in learning accurate geometry from sparse $360^{\circ}$ views. To address this problem, we propose PanoGRF, Generalizable Spherical Radiance Fields for Wide-baseline Panoramas, which construct spherical radiance fields incorporating $360^{\circ}$ scene priors. Unlike generalizable radiance fields trained on perspective images, PanoGRF avoids the information loss from panorama-to-perspective conversion and directly aggregates geometry and appearance features of 3D sample points from each panoramic view based on spherical projection. Moreover, as some regions of the panorama are only visible from one view while invisible from others under wide baseline settings, PanoGRF incorporates $360^{\circ}$ monocular depth priors into spherical depth estimation to improve the geometry features. Experimental results on multiple panoramic datasets demonstrate that PanoGRF significantly outperforms state-of-the-art generalizable view synthesis methods for wide-baseline panoramas (e.g., OmniSyn) and perspective images (e.g., IBRNet, NeuRay).
There are increasing concerns about malicious attacks on autonomous vehicles. In particular, inaudible voice command attacks pose a significant threat as voice commands become available in autonomous driving systems. How to empirically defend against these inaudible attacks remains an open question. Previous research investigates utilizing deep learning-based multimodal fusion for defense, without considering the model uncertainty in trustworthiness. As deep learning has been applied to increasingly sensitive tasks, uncertainty measurement is crucial in helping improve model robustness, especially in mission-critical scenarios. In this paper, we propose the Multimodal Fusion Framework (MFF) as an intelligent security system to defend against inaudible voice command attacks. MFF fuses heterogeneous audio-vision modalities using VGG family neural networks and achieves the detection accuracy of 92.25% in the comparative fusion method empirical study. Additionally, extensive experiments on audio-vision tasks reveal the model's uncertainty. Using Expected Calibration Errors, we measure calibration errors and Monte-Carlo Dropout to estimate the predictive distribution for the proposed models. Our findings show empirically to train robust multimodal models, improve standard accuracy and provide a further step toward interpretability. Finally, we discuss the pros and cons of our approach and its applicability for Advanced Driver Assistance Systems.
Recent analysis of incidents involving Autonomous Driving Systems (ADS) has shown that the decision-making process of ADS can be significantly different from that of human drivers. To improve the performance of ADS, it may be helpful to incorporate the human decision-making process, particularly the signals provided by the human gaze. There are many existing works to create human gaze datasets and predict the human gaze using deep learning models. However, current datasets of human gaze are noisy and include irrelevant objects that can hinder model training. Additionally, existing CNN-based models for predicting human gaze lack generalizability across different datasets and driving conditions, and many models have a centre bias in their prediction such that the gaze tends to be generated in the centre of the gaze map. To address these gaps, we propose an adaptive method for cleansing existing human gaze datasets and a robust convolutional self-attention gaze prediction model. Our quantitative metrics show that our cleansing method improves models' performance by up to 7.38% and generalizability by up to 8.24% compared to those trained on the original datasets. Furthermore, our model demonstrates an improvement of up to 12.13% in terms of generalizability compared to the state-of-the-art (SOTA) models. Notably, it achieves these gains while conserving up to 98.12% of computation resources.
Existing models in cognitive science typically assume human categorization as graded generalization behavior in a multidimensional psychological space. However, category representations in these models may suffer from the curse of dimensionality in a natural setting. People generally rely on a tractable yet sufficient set of features to understand the complex environment. We propose a rational model of categorization based on a hierarchical mixture of probabilistic principal components, that simultaneously learn category representations and an economical collection of features. The model captures dimensional biases in human categorization and supports zero-shot learning. We further exploit a generative process within a low-dimensional latent space to provide a better account of categorization with high-dimensional stimuli. We validate the model with simulation and behavioral experiments.
With the world population rapidly increasing, transforming our agrifood systems to be more productive, efficient, safe, and sustainable is crucial to mitigate potential food shortages. Recently, artificial intelligence (AI) techniques such as deep learning (DL) have demonstrated their strong abilities in various areas, including language, vision, remote sensing (RS), and agrifood systems applications. However, the overall impact of AI on agrifood systems remains unclear. In this paper, we thoroughly review how AI techniques can transform agrifood systems and contribute to the modern agrifood industry. Firstly, we summarize the data acquisition methods in agrifood systems, including acquisition, storage, and processing techniques. Secondly, we present a progress review of AI methods in agrifood systems, specifically in agriculture, animal husbandry, and fishery, covering topics such as agrifood classification, growth monitoring, yield prediction, and quality assessment. Furthermore, we highlight potential challenges and promising research opportunities for transforming modern agrifood systems with AI. We hope this survey could offer an overall picture to newcomers in the field and serve as a starting point for their further research.
Focused plenoptic cameras can record spatial and angular information of the light field (LF) simultaneously with higher spatial resolution relative to traditional plenoptic cameras, which facilitate various applications in computer vision. However, the existing plenoptic image compression methods present ineffectiveness to the captured images due to the complex micro-textures generated by the microlens relay imaging and long-distance correlations among the microimages. In this paper, a lossy end-to-end learning architecture is proposed to compress the focused plenoptic images efficiently. First, a data preprocessing scheme is designed according to the imaging principle to remove the sub-aperture image ineffective pixels in the recorded light field and align the microimages to the rectangular grid. Then, the global attention module with large receptive field is proposed to capture the global correlation among the feature maps using pixel-wise vector attention computed in the resampling process. Also, a new image dataset consisting of 1910 focused plenoptic images with content and depth diversity is built to benefit training and testing. Extensive experimental evaluations demonstrate the effectiveness of the proposed approach. It outperforms intra coding of HEVC and VVC by an average of 62.57% and 51.67% bitrate reduction on the 20 preprocessed focused plenoptic images, respectively. Also, it achieves 18.73% bitrate saving and generates perceptually pleasant reconstructions compared to the state-of-the-art end-to-end image compression methods, which benefits the applications of focused plenoptic cameras greatly. The dataset and code are publicly available at https://github.com/VincentChandelier/GACN.
Indoor relocalization is vital for both robotic tasks like autonomous exploration and civil applications such as navigation with a cell phone in a shopping mall. Some previous approaches adopt geometrical information such as key-point features or local textures to carry out indoor relocalization, but they either easily fail in an environment with visually similar scenes or require many database images. Inspired by the fact that humans often remember places by recognizing unique landmarks, we resort to objects, which are more informative than geometry elements. In this work, we propose a simple yet effective object-based indoor relocalization approach, dubbed AirLoc. To overcome the critical challenges of object reidentification and remembering object relationships, we extract object-wise appearance embedding and inter-object geometric relationships. The geometry and appearance features are integrated to generate cumulative scene features. This results in a robust, accurate, and portable indoor relocalization system, which outperforms the state-of-the-art methods in room-level relocalization by 9.5% of PR-AUC and 7% of accuracy. In addition to exhaustive evaluation, we also carry out real-world tests, where AirLoc shows robustness in challenges like severe occlusion, perceptual aliasing, viewpoint shift, and deformation.