Camera traps are used worldwide to monitor wildlife. Despite the increasing availability of Deep Learning (DL) models, the effective usage of this technology to support wildlife monitoring is limited. This is mainly due to the complexity of DL technology and high computing requirements. This paper presents the implementation of the light-weight and state-of-the-art YOLOv5 architecture for automated labeling of camera trap images of mammals in the Bialowieza Forest (BF), Poland. The camera trapping data were organized and harmonized using TRAPPER software, an open source application for managing large-scale wildlife monitoring projects. The proposed image recognition pipeline achieved an average accuracy of 85% F1-score in the identification of the 12 most commonly occurring medium-size and large mammal species in BF using a limited set of training and testing data (a total 2659 images with animals). Based on the preliminary results, we concluded that the YOLOv5 object detection and classification model is a promising light-weight DL solution after the adoption of transfer learning technique. It can be efficiently plugged in via an API into existing web-based camera trapping data processing platforms such as e.g. TRAPPER system. Since TRAPPER is already used to manage and classify (manually) camera trapping datasets by many research groups in Europe, the implementation of AI-based automated species classification may significantly speed up the data processing workflow and thus better support data-driven wildlife monitoring and conservation. Moreover, YOLOv5 developers perform better performance on edge devices which may open a new chapter in animal population monitoring in real time directly from camera trap devices.
Ground Penetrating Radar (GPR) is one of the most important non-destructive evaluation (NDE) instruments to detect and locate underground objects (i.e. rebars, utility pipes). Many of the previous researches focus on GPR image-based feature detection only, and none can process sparse GPR measurements to successfully reconstruct a very fine and detailed 3D model of underground objects for better visualization. To address this problem, this paper presents a novel robotic system to collect GPR data, localize the underground utilities, and reconstruct the underground objects' dense point cloud model. This system is composed of three modules: 1) visual-inertial-based GPR data collection module which tags the GPR measurements with positioning information provided by an omnidirectional robot; 2) a deep neural network (DNN) migration module to interpret the raw GPR B-scan image into a cross-section of object model; 3) a DNN-based 3D reconstruction module, i.e., GPRNet, to generate underground utility model with the fine 3D point cloud. The experiments show that our method can generate a dense and complete point cloud model of pipe-shaped utilities based on a sparse input, i.e., GPR raw data, with various levels of incompleteness and noise. The experiment results on synthetic data as well as field test data verified the effectiveness of our method.
We aim at accelerating super-resolution (SR) networks on large images (2K-8K). The large images are usually decomposed into small sub-images in practical usages. Based on this processing, we found that different image regions have different restoration difficulties and can be processed by networks with different capacities. Intuitively, smooth areas are easier to super-solve than complex textures. To utilize this property, we can adopt appropriate SR networks to process different sub-images after the decomposition. On this basis, we propose a new solution pipeline -- ClassSR that combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions. We further introduce a new classification method with two losses -- Class-Loss and Average-Loss to produce the classification results. After joint training, a majority of sub-images will pass through smaller networks, thus the computational cost can be significantly reduced. Experiments show that our ClassSR can help most existing methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up to 50% FLOPs on DIV8K datasets. This general framework can also be applied in other low-level vision tasks.
Predicting registration error can be useful for evaluation of registration procedures, which is important for the adoption of registration techniques in the clinic. In addition, quantitative error prediction can be helpful in improving the registration quality. The task of predicting registration error is demanding due to the lack of a ground truth in medical images. This paper proposes a new automatic method to predict the registration error in a quantitative manner, and is applied to chest CT scans. A random regression forest is utilized to predict the registration error locally. The forest is built with features related to the transformation model and features related to the dissimilarity after registration. The forest is trained and tested using manually annotated corresponding points between pairs of chest CT scans in two experiments: SPREAD (trained and tested on SPREAD) and inter-database (including three databases SPREAD, DIR-Lab-4DCT and DIR-Lab-COPDgene). The results show that the mean absolute errors of regression are 1.07 $\pm$ 1.86 and 1.76 $\pm$ 2.59 mm for the SPREAD and inter-database experiment, respectively. The overall accuracy of classification in three classes (correct, poor and wrong registration) is 90.7% and 75.4%, for SPREAD and inter-database respectively. The good performance of the proposed method enables important applications such as automatic quality control in large-scale image analysis.
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.
Printed Mathematical expression recognition (PMER) aims to transcribe a printed mathematical expression image into a structural expression, such as LaTeX expression. It is a crucial task for many applications, including automatic question recommendation, automatic problem solving and analysis of the students, etc. Currently, the mainstream solutions rely on solving image captioning tasks, all addressing image summarization. As such, these methods can be suboptimal for solving MER problem. In this paper, we propose a new method named EDSL, shorted for encoder-decoder with symbol-level features, to identify the printed mathematical expressions from images. The symbol-level image encoder of EDSL consists of segmentation module and reconstruction module. By performing segmentation module, we identify all the symbols and their spatial information from images in an unsupervised manner. We then design a novel reconstruction module to recover the symbol dependencies after symbol segmentation. Especially, we employ a position correction attention mechanism to capture the spatial relationships between symbols. To alleviate the negative impact from long output, we apply the transformer model for transcribing the encoded image into the sequential and structural output. We conduct extensive experiments on two real datasets to verify the effectiveness and rationality of our proposed EDSL method. The experimental results have illustrated that EDSL has achieved 92.7\% and 89.0\% in evaluation metric Match, which are 3.47\% and 4.04\% higher than the state-of-the-art method. Our code and datasets are available at https://github.com/abcAnonymous/EDSL .
Unsupervised shadow removal aims to learn a non-linear function to map the original image from shadow domain to non-shadow domain in the absence of paired shadow and non-shadow data. In this paper, we develop a simple yet efficient target-consistency generative adversarial network (TC-GAN) for the shadow removal task in the unsupervised manner. Compared with the bidirectional mapping in cycle-consistency GAN based methods for shadow removal, TC-GAN tries to learn a one-sided mapping to cast shadow images into shadow-free ones. With the proposed target-consistency constraint, the correlations between shadow images and the output shadow-free image are strictly confined. Extensive comparison experiments results show that TC-GAN outperforms the state-of-the-art unsupervised shadow removal methods by 14.9% in terms of FID and 31.5% in terms of KID. It is rather remarkable that TC-GAN achieves comparable performance with supervised shadow removal methods.
The automatic grading of diabetic retinopathy (DR) facilitates medical diagnosis for both patients and physicians. Existing researches formulate DR grading as an image classification problem. As the stages/categories of DR correlate with each other, the relationship between different classes cannot be explicitly described via a one-hot label because it is empirically estimated by different physicians with different outcomes. This class correlation limits existing networks to achieve effective classification. In this paper, we propose a Graph REsidual rE-ranking Network (GREEN) to introduce a class dependency prior into the original image classification network. The class dependency prior is represented by a graph convolutional network with an adjacency matrix. This prior augments image classification pipeline by re-ranking classification results in a residual aggregation manner. Experiments on the standard benchmarks have shown that GREEN performs favorably against state-of-the-art approaches.
We propose an automated method based on deep learning to compute the cardiothoracic ratio and detect the presence of cardiomegaly from chest radiographs. We develop two separate models to demarcate the heart and chest regions in an X-ray image using bounding boxes and use their outputs to calculate the cardiothoracic ratio. We obtain a sensitivity of 0.96 at a specificity of 0.81 with a mean absolute error of 0.0209 on a held-out test dataset and a sensitivity of 0.84 at a specificity of 0.97 with a mean absolute error of 0.018 on an independent dataset from a different hospital. We also compare three different segmentation model architectures for the proposed method and observe that Attention U-Net yields better results than SE-Resnext U-Net and EfficientNet U-Net. By providing a numeric measurement of the cardiothoracic ratio, we hope to mitigate human subjectivity arising out of visual assessment in the detection of cardiomegaly.
Multi-label classification tasks such as OCR and multi-object recognition are a major focus of the growing machine learning as a service industry. While many multi-label prediction APIs are available, it is challenging for users to decide which API to use for their own data and budget, due to the heterogeneity in those APIs' price and performance. Recent work shows how to select from single-label prediction APIs. However the computation complexity of the previous approach is exponential in the number of labels and hence is not suitable for settings like OCR. In this work, we propose FrugalMCT, a principled framework that adaptively selects the APIs to use for different data in an online fashion while respecting user's budget. The API selection problem is cast as an integer linear program, which we show has a special structure that we leverage to develop an efficient online API selector with strong performance guarantees. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Tencent and other providers for tasks including multi-label image classification, scene text recognition and named entity recognition. Across diverse tasks, FrugalMCT can achieve over 90% cost reduction while matching the accuracy of the best single API, or up to 8% better accuracy while matching the best API's cost.