Alert button
Picture for Esa Rahtu

Esa Rahtu

Alert button

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

Nov 05, 2023
Xuqian Ren, Wenjia Wang, Dingding Cai, Tuuli Tuominen, Juho Kannala, Esa Rahtu

Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in a unified framework. To address this gap and promote the development of robust and immersive modeling and rendering with consumer-grade devices, first, we propose a real-world Multi-Sensor Hybrid Room Dataset (MuSHRoom). Our dataset presents exciting challenges and requires state-of-the-art methods to be cost-effective, robust to noisy data and devices, and can jointly learn 3D reconstruction and novel view synthesis, instead of treating them as separate tasks, making them ideal for real-world applications. Second, we benchmark several famous pipelines on our dataset for joint 3D mesh reconstruction and novel view synthesis. Finally, in order to further improve the overall performance, we propose a new method that achieves a good trade-off between the two tasks. Our dataset and benchmark show great potential in promoting the improvements for fusing 3D reconstruction and high-quality rendering in a robust and computationally efficient end-to-end fashion.

Viaarxiv icon

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Apr 04, 2023
Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, Shin'ichi Satoh

Figure 1 for Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Figure 2 for Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Figure 3 for Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Figure 4 for Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.

* CVPR 2023 
Viaarxiv icon

FinnWoodlands Dataset

Apr 03, 2023
Juan Lagos, Urho Lempiö, Esa Rahtu

Figure 1 for FinnWoodlands Dataset
Figure 2 for FinnWoodlands Dataset
Figure 3 for FinnWoodlands Dataset
Figure 4 for FinnWoodlands Dataset

While the availability of large and diverse datasets has contributed to significant breakthroughs in autonomous driving and indoor applications, forestry applications are still lagging behind and new forest datasets would most certainly contribute to achieving significant progress in the development of data-driven methods for forest-like scenarios. This paper introduces a forest dataset called \textit{FinnWoodlands}, which consists of RGB stereo images, point clouds, and sparse depth maps, as well as ground truth manual annotations for semantic, instance, and panoptic segmentation. \textit{FinnWoodlands} comprises a total of 4226 objects manually annotated, out of which 2562 objects (60.6\%) correspond to tree trunks classified into three different instance categories, namely "Spruce Tree", "Birch Tree", and "Pine Tree". Besides tree trunks, we also annotated "Obstacles" objects as instances as well as the semantic stuff classes "Lake", "Ground", and "Track". Our dataset can be used in forestry applications where a holistic representation of the environment is relevant. We provide an initial benchmark using three models for instance segmentation, panoptic segmentation, and depth completion, and illustrate the challenges that such unstructured scenarios introduce.

* Scandinavian Conference on Image Analysis 2023 
Viaarxiv icon

MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose Estimation

Feb 14, 2023
Dingding Cai, Janne Heikkilä, Esa Rahtu

Figure 1 for MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose Estimation
Figure 2 for MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose Estimation
Figure 3 for MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose Estimation
Figure 4 for MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose Estimation

Acquiring labeled 6D poses from real images is an expensive and time-consuming task. Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap. To mitigate this degradation, we propose a practical self-supervised domain adaptation approach that takes advantage of real RGB(-D) data without needing real pose labels. We first pre-train the model with synthetic RGB images and then utilize real RGB(-D) images to fine-tune the pre-trained model. The fine-tuning process is self-supervised by the RGB-based pose-aware consistency and the depth-guided object distance pseudo-label, which does not require the time-consuming online differentiable rendering. We build our domain adaptation method based on the recent pose estimator SC6D and evaluate it on the YCB-Video dataset. We experimentally demonstrate that our method achieves comparable performance against its fully-supervised counterpart while outperforming existing state-of-the-art approaches.

* SCIA2023 
Viaarxiv icon

BS3D: Building-scale 3D Reconstruction from RGB-D Images

Jan 03, 2023
Janne Mustaniemi, Juho Kannala, Esa Rahtu, Li Liu, Janne Heikkilä

Figure 1 for BS3D: Building-scale 3D Reconstruction from RGB-D Images
Figure 2 for BS3D: Building-scale 3D Reconstruction from RGB-D Images
Figure 3 for BS3D: Building-scale 3D Reconstruction from RGB-D Images
Figure 4 for BS3D: Building-scale 3D Reconstruction from RGB-D Images

Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.

Viaarxiv icon

PanDepth: Joint Panoptic Segmentation and Depth Completion

Dec 29, 2022
Juan Lagos, Esa Rahtu

Figure 1 for PanDepth: Joint Panoptic Segmentation and Depth Completion
Figure 2 for PanDepth: Joint Panoptic Segmentation and Depth Completion
Figure 3 for PanDepth: Joint Panoptic Segmentation and Depth Completion
Figure 4 for PanDepth: Joint Panoptic Segmentation and Depth Completion

Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple computer vision tasks are involved. Multi-task models provide different types of outputs for a given scene, yielding a more holistic representation while keeping the computational cost low. We propose a multi-task model for panoptic segmentation and depth completion using RGB images and sparse depth maps. Our model successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and panoptic segmentation for every input frame. Extensive experiments were done on the Virtual KITTI 2 dataset and we demonstrate that our model solves multiple tasks, without a significant increase in computational cost, while keeping high accuracy performance. Code is available at https://github.com/juanb09111/PanDepth.git

Viaarxiv icon

Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition

Nov 14, 2022
Farid Alijani, Esa Rahtu

Figure 1 for Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition
Figure 2 for Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition
Figure 3 for Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition
Figure 4 for Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition

In this paper, we present a comprehensive study on the utility of deep convolutional neural networks with two state-of-the-art pooling layers which are placed after convolutional layers and fine-tuned in an end-to-end manner for visual place recognition task in challenging conditions, including seasonal and illumination variations. We compared extensively the performance of deep learned global features with three different loss functions, e.g. triplet, contrastive and ArcFace, for learning the parameters of the architectures in terms of fraction of the correct matches during deployment. To verify effectiveness of our results, we utilized two real world datasets in place recognition, both indoor and outdoor. Our investigation demonstrates that fine tuning architectures with ArcFace loss in an end-to-end manner outperforms other two losses by approximately 1~4% in outdoor and 1~2% in indoor datasets, given certain thresholds, for the visual place recognition tasks.

* mmsp (2021) 1-6  
Viaarxiv icon

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

Oct 13, 2022
Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

Figure 1 for Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors
Figure 2 for Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors
Figure 3 for Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors
Figure 4 for Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync

* Accepted as a spotlight presentation for the BMVC 2022. Code: https://github.com/v-iashin/SparseSync Project page: https://v-iashin.github.io/SparseSync 
Viaarxiv icon

The Weighting Game: Evaluating Quality of Explainability Methods

Aug 12, 2022
Lassi Raatikainen, Esa Rahtu

Figure 1 for The Weighting Game: Evaluating Quality of Explainability Methods
Figure 2 for The Weighting Game: Evaluating Quality of Explainability Methods
Figure 3 for The Weighting Game: Evaluating Quality of Explainability Methods
Figure 4 for The Weighting Game: Evaluating Quality of Explainability Methods

The objective of this paper is to assess the quality of explanation heatmaps for image classification tasks. To assess the quality of explainability methods, we approach the task through the lens of accuracy and stability. In this work, we make the following contributions. Firstly, we introduce the Weighting Game, which measures how much of a class-guided explanation is contained within the correct class' segmentation mask. Secondly, we introduce a metric for explanation stability, using zooming/panning transformations to measure differences between saliency maps with similar contents. Quantitative experiments are produced, using these new metrics, to evaluate the quality of explanations provided by commonly used CAM methods. The quality of explanations is also contrasted between different model architectures, with findings highlighting the need to consider model architecture when choosing an explainability method.

Viaarxiv icon

HRF-Net: Holistic Radiance Fields from Sparse Inputs

Aug 09, 2022
Phong Nguyen-Ha, Lam Huynh, Esa Rahtu, Jiri Matas, Janne Heikkila

Figure 1 for HRF-Net: Holistic Radiance Fields from Sparse Inputs
Figure 2 for HRF-Net: Holistic Radiance Fields from Sparse Inputs
Figure 3 for HRF-Net: Holistic Radiance Fields from Sparse Inputs
Figure 4 for HRF-Net: Holistic Radiance Fields from Sparse Inputs

We present HRF-Net, a novel view synthesis method based on holistic radiance fields that renders novel views using a set of sparse inputs. Recent generalizing view synthesis methods also leverage the radiance fields but the rendering speed is not real-time. There are existing methods that can train and render novel views efficiently but they can not generalize to unseen scenes. Our approach addresses the problem of real-time rendering for generalizing view synthesis and consists of two main stages: a holistic radiance fields predictor and a convolutional-based neural renderer. This architecture infers not only consistent scene geometry based on the implicit neural fields but also renders new views efficiently using a single GPU. We first train HRF-Net on multiple 3D scenes of the DTU dataset and the network can produce plausible novel views on unseen real and synthetics data using only photometric losses. Moreover, our method can leverage a denser set of reference images of a single scene to produce accurate novel views without relying on additional explicit representations and still maintains the high-speed rendering of the pre-trained model. Experimental results show that HRF-Net outperforms state-of-the-art generalizable neural rendering methods on various synthetic and real datasets.

* In submission 
Viaarxiv icon