Alert button
Picture for Juho Kannala

Juho Kannala

Alert button

DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching

Jun 21, 2023
Shuzhe Wang, Juho Kannala, Daniel Barath

Figure 1 for DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching
Figure 2 for DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching
Figure 3 for DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching
Figure 4 for DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching

Direct matching of 2D keypoints in an input image to a 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its lower memory requirements, inherent privacy preservation, and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However, existing algorithms often compromise on performance, resulting in a significant deterioration compared to their descriptor-based counterparts. In this paper, we introduce DGC-GNN, a novel algorithm that employs a global-to-local Graph Neural Network (GNN) that progressively exploits geometric and color cues to represent keypoints, thereby improving matching robustness. Our global-to-local procedure encodes both Euclidean and angular relations at a coarse level, forming the geometric embedding to guide the local point matching. We evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not only doubles the accuracy of the state-of-the-art descriptor-free algorithm but, also, substantially narrows the performance gap between descriptor-based and descriptor-free methods. The code and trained models will be made publicly available.

Viaarxiv icon

Simplified Temporal Consistency Reinforcement Learning

Jun 15, 2023
Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, Joni Pajarinen

Figure 1 for Simplified Temporal Consistency Reinforcement Learning
Figure 2 for Simplified Temporal Consistency Reinforcement Learning
Figure 3 for Simplified Temporal Consistency Reinforcement Learning
Figure 4 for Simplified Temporal Consistency Reinforcement Learning

Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.

Viaarxiv icon

HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

May 05, 2023
Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, Yi Zhao, Giorgos Tolias, Juho Kannala

Figure 1 for HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer
Figure 2 for HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer
Figure 3 for HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer
Figure 4 for HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image localization on the 7-Scenes, 12 Scenes, Cambridge Landmarks datasets, and the combined indoor scenes.

Viaarxiv icon

TBPos: Dataset for Large-Scale Precision Visual Localization

Feb 20, 2023
Masud Fahim, Ilona Söchting, Luca Ferranti, Juho Kannala, Jani Boutellier

Figure 1 for TBPos: Dataset for Large-Scale Precision Visual Localization
Figure 2 for TBPos: Dataset for Large-Scale Precision Visual Localization
Figure 3 for TBPos: Dataset for Large-Scale Precision Visual Localization
Figure 4 for TBPos: Dataset for Large-Scale Precision Visual Localization

Image based localization is a classical computer vision challenge, with several well-known datasets. Generally, datasets consist of a visual 3D database that captures the modeled scenery, as well as query images whose 3D pose is to be discovered. Usually the query images have been acquired with a camera that differs from the imaging hardware used to collect the 3D database; consequently, it is hard to acquire accurate ground truth poses between query images and the 3D database. As the accuracy of visual localization algorithms constantly improves, precise ground truth becomes increasingly important. This paper proposes TBPos, a novel large-scale visual dataset for image based positioning, which provides query images with fully accurate ground truth poses: both the database images and the query images have been derived from the same laser scanner data. In the experimental part of the paper, the proposed dataset is evaluated by means of an image-based localization pipeline.

* Scandinavian Conference on Image Analysis 2023 
Viaarxiv icon

BS3D: Building-scale 3D Reconstruction from RGB-D Images

Jan 03, 2023
Janne Mustaniemi, Juho Kannala, Esa Rahtu, Li Liu, Janne Heikkilä

Figure 1 for BS3D: Building-scale 3D Reconstruction from RGB-D Images
Figure 2 for BS3D: Building-scale 3D Reconstruction from RGB-D Images
Figure 3 for BS3D: Building-scale 3D Reconstruction from RGB-D Images
Figure 4 for BS3D: Building-scale 3D Reconstruction from RGB-D Images

Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.

Viaarxiv icon

MixupE: Understanding and Improving Mixup from Directional Derivative Perspective

Dec 29, 2022
Vikas Verma, Sarthak Mittal, Wai Hoh Tang, Hieu Pham, Juho Kannala, Yoshua Bengio, Arno Solin, Kenji Kawaguchi

Figure 1 for MixupE: Understanding and Improving Mixup from Directional Derivative Perspective
Figure 2 for MixupE: Understanding and Improving Mixup from Directional Derivative Perspective
Figure 3 for MixupE: Understanding and Improving Mixup from Directional Derivative Perspective
Figure 4 for MixupE: Understanding and Improving Mixup from Directional Derivative Perspective

Mixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpolating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this work, we first analyze Mixup and show that it implicitly regularizes infinitely many directional derivatives of all orders. We then propose a new method to improve Mixup based on the novel insight. To demonstrate the effectiveness of the proposed method, we conduct experiments across various domains such as images, tabular data, speech, and graphs. Our results show that the proposed method improves Mixup across various datasets using a variety of architectures, for instance, exhibiting an improvement over Mixup by 0.8% in ImageNet top-1 accuracy.

Viaarxiv icon

SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation and Prediction

Nov 28, 2022
Hao Dong, Xianjing Zhang, Xuan Jiang, Jun Zhang, Jintao Xu, Rui Ai, Weihao Gu, Huimin Lu, Juho Kannala, Xieyuanli Chen

Figure 1 for SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation and Prediction
Figure 2 for SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation and Prediction
Figure 3 for SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation and Prediction
Figure 4 for SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation and Prediction

High-definition (HD) semantic map generation of the environment is an essential component of autonomous driving. Existing methods have achieved good performance in this task by fusing different sensor modalities, such as LiDAR and camera. However, current works are based on raw data or network feature-level fusion and only consider short-range HD map generation, limiting their deployment to realistic autonomous driving applications. In this paper, we focus on the task of building the HD maps in both short ranges, i.e., within 30 m, and also predicting long-range HD maps up to 90 m, which is required by downstream path planning and control tasks to improve the smoothness and safety of autonomous driving. To this end, we propose a novel network named SuperFusion, exploiting the fusion of LiDAR and camera data at multiple levels. We benchmark our SuperFusion on the nuScenes dataset and a self-recorded dataset and show that it outperforms the state-of-the-art baseline methods with large margins. Furthermore, we propose a new metric to evaluate the long-range HD map prediction and apply the generated HD map to a downstream path planning task. The results show that by using the long-range HD maps predicted by our method, we can make better path planning for autonomous vehicles. The code will be available at https://github.com/haomo-ai/SuperFusion.

Viaarxiv icon

Expansion of Visual Hints for Improved Generalization in Stereo Matching

Nov 01, 2022
Andrea Pilzer, Yuxin Hou, Niki Loppi, Arno Solin, Juho Kannala

Figure 1 for Expansion of Visual Hints for Improved Generalization in Stereo Matching
Figure 2 for Expansion of Visual Hints for Improved Generalization in Stereo Matching
Figure 3 for Expansion of Visual Hints for Improved Generalization in Stereo Matching
Figure 4 for Expansion of Visual Hints for Improved Generalization in Stereo Matching

We introduce visual hints expansion for guiding stereo matching to improve generalization. Our work is motivated by the robustness of Visual Inertial Odometry (VIO) in computer vision and robotics, where a sparse and unevenly distributed set of feature points characterizes a scene. To improve stereo matching, we propose to elevate 2D hints to 3D points. These sparse and unevenly distributed 3D visual hints are expanded using a 3D random geometric graph, which enhances the learning and inference process. We evaluate our proposal on multiple widely adopted benchmarks and show improved performance without access to additional sensors other than the image sequence. To highlight practical applicability and symbiosis with visual odometry, we demonstrate how our methods run on embedded hardware.

* 2023 IEEE Winter Conference on Applications of Computer Vision (WACV) 
Viaarxiv icon

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Oct 25, 2022
Yi Zhao, Rinu Boney, Alexander Ilin, Juho Kannala, Joni Pajarinen

Figure 1 for Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning
Figure 2 for Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning
Figure 3 for Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning
Figure 4 for Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: \url{https://github.com/zhaoyi11/adaptive_bc}.

Viaarxiv icon

Continuous Monte Carlo Graph Search

Oct 04, 2022
Amin Babadi, Yi Zhao, Juho Kannala, Alexander Ilin, Joni Pajarinen

Figure 1 for Continuous Monte Carlo Graph Search
Figure 2 for Continuous Monte Carlo Graph Search
Figure 3 for Continuous Monte Carlo Graph Search
Figure 4 for Continuous Monte Carlo Graph Search

In many complex sequential decision making tasks, online planning is crucial for high-performance. For efficient online planning, Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off between exploration and exploitation. MCTS outperforms comparison methods in various discrete decision making domains such as Go, Chess, and Shogi. Following, extensions of MCTS to continuous domains have been proposed. However, the inherent high branching factor and the resulting explosion of search tree size is limiting existing methods. To solve this problem, this paper proposes Continuous Monte Carlo Graph Search (CMCGS), a novel extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered graph instead of an MCTS search tree. Experimental evaluation with limited sample budgets shows that CMCGS outperforms comparison methods in several complex continuous DeepMind Control Suite benchmarks and a 2D navigation task.

* Under review as a conference paper at ICLR 2023 
Viaarxiv icon