This paper proposed a novel anomaly detection (AD) approach of High-speed Train images based on convolutional neural networks and the Vision Transformer. Different from previous AD works, in which anomalies are identified with a single image using classification, segmentation, or object detection methods, the proposed method detects abnormal difference between two images taken at different times of the same region. In other words, we cast anomaly detection problem with a single image into a difference detection problem with two images. The core idea of the proposed method is that the 'anomaly' usually represents an abnormal state instead of a specific object, and this state should be identified by a pair of images. In addition, we introduced a deep feature difference AD network (AnoDFDNet) which sufficiently explored the potential of the Vision Transformer and convolutional neural networks. To verify the effectiveness of the proposed AnoDFDNet, we collected three datasets, a difference dataset (Diff Dataset), a foreign body dataset (FB Dataset), and an oil leakage dataset (OL Dataset). Experimental results on above datasets demonstrate the superiority of proposed method. Source code are available at https://github.com/wangle53/AnoDFDNet.
India is the second largest producer of fruits and vegetables in the world, and one of the largest consumers of fruits like Banana, Papaya and Mangoes through retail and ecommerce giants like BigBasket, Grofers and Amazon Fresh. However, adoption of technology in supply chain and retail stores is still low and there is a great potential to adopt computer-vision based technology for identification and classification of fruits. We have chosen banana fruit to build a computer vision based model to carry out the following three use-cases (a) Identify Banana from a given image (b) Determine sub-family or variety of Banana (c) Determine the quality of Banana. Successful execution of these use-cases using computer-vision model would greatly help with overall inventory management automation, quality control, quick and efficient weighing and billing which all are manual labor intensive currently. In this work, we suggest a machine learning pipeline that combines the ideas of CNNs, transfer learning, and data augmentation towards improving Banana fruit sub family and quality image classification. We have built a basic CNN and then went on to tune a MobileNet Banana classification model using a combination of self-curated and publicly-available dataset of 3064 images. The results show an overall 93.4% and 100% accuracy for sub-family/variety and for quality test classifications respectively.
Transformer and its variants have shown state-of-the-art results in many vision tasks recently, ranging from image classification to dense prediction. Despite of their success, limited work has been reported on improving the model efficiency for deployment in latency-critical applications, such as autonomous driving and robotic navigation. In this paper, we aim at improving upon the existing transformers in vision, and propose a method for self-supervised monocular Depth Estimation with Simplified Transformer (DEST), which is efficient and particularly suitable for deployment on GPU-based platforms. Through strategic design choices, our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art. We also show that our design generalize well to other dense prediction task without bells and whistles.
Rate-distortion optimization (RDO) of codecs, where distortion is quantified by the mean-square error, has been a standard practice in image/video compression over the years. RDO serves well for optimization of codec performance for evaluation of the results in terms of PSNR. However, it is well known that the PSNR does not correlate well with perceptual evaluation of images; hence, RDO is not well suited for perceptual optimization of codecs. Recently, rate-distortion-perception trade-off has been formalized by taking the Kullback-Leibner (KL) divergence between the distributions of the original and reconstructed images as a perception measure. Learned image compression methods that simultaneously optimize rate, mean-square loss, VGG loss, and an adversarial loss were proposed. Yet, there exists no easy approach to fix the rate, distortion or perception at a desired level in a practical learned image compression solution to perform an analysis of the trade-off between rate, distortion and perception measures. In this paper, we propose a practical approach to fix the rate to carry out perception-distortion analysis at a fixed rate in order to perform perceptual evaluation of image compression results in a principled manner. Experimental results provide several insights for practical rate-distortion-perception analysis in learned image compression.
It is common to have continuous streams of new data that need to be introduced in the system in real-world applications. The model needs to learn newly added capabilities (future tasks) while retaining the old knowledge (past tasks). Incremental learning has recently become increasingly appealing for this problem. Task-incremental learning is a kind of incremental learning where task identity of newly included task (a set of classes) remains known during inference. A common goal of task-incremental methods is to design a network that can operate on minimal size, maintaining decent performance. To manage the stability-plasticity dilemma, different methods utilize replay memory of past tasks, specialized hardware, regularization monitoring etc. However, these methods are still less memory efficient in terms of architecture growth or input data costs. In this study, we present a simple yet effective adjustment network (SAN) for task incremental learning that achieves near state-of-the-art performance while using minimal architectural size without using memory instances compared to previous state-of-the-art approaches. We investigate this approach on both 3D point cloud object (ModelNet40) and 2D image (CIFAR10, CIFAR100, MiniImageNet, MNIST, PermutedMNIST, notMNIST, SVHN, and FashionMNIST) recognition tasks and establish a strong baseline result for a fair comparison with existing methods. On both 2D and 3D domains, we also observe that SAN is primarily unaffected by different task orders in a task-incremental setting.
Deep learning-based image inpainting algorithms have shown great performance via powerful learned prior from the numerous external natural images. However, they show unpleasant results on the test image whose distribution is far from the that of training images because their models are biased toward the training images. In this paper, we propose a simple image inpainting algorithm with test-time adaptation named AdaFill. Given a single out-of-distributed test image, our goal is to complete hole region more naturally than the pre-trained inpainting models. To achieve this goal, we treat remained valid regions of the test image as another training cues because natural images have strong internal similarities. From this test-time adaptation, our network can exploit externally learned image priors from the pre-trained features as well as the internal prior of the test image explicitly. Experimental results show that AdaFill outperforms other models on the various out-of-distribution test images. Furthermore, the model named ZeroFill, that are not pre-trained also sometimes outperforms the pre-trained models.
This paper examines a combined supervised-unsupervised framework involving dictionary-based blind learning and deep supervised learning for MR image reconstruction from under-sampled k-space data. A major focus of the work is to investigate the possible synergy of learned features in traditional shallow reconstruction using adaptive sparsity-based priors and deep prior-based reconstruction. Specifically, we propose a framework that uses an unrolled network to refine a blind dictionary learning-based reconstruction. We compare the proposed method with strictly supervised deep learning-based reconstruction approaches on several datasets of varying sizes and anatomies. We also compare the proposed method to alternative approaches for combining dictionary-based methods with supervised learning in MR image reconstruction. The improvements yielded by the proposed framework suggest that the blind dictionary-based approach preserves fine image details that the supervised approach can iteratively refine, suggesting that the features learned using the two methods are complementary
The task of Outside Knowledge Visual Question Answering (OKVQA) requires an automatic system to answer natural language questions about pictures and images using external knowledge. We observe that many visual questions, which contain deictic referential phrases referring to entities in the image, can be rewritten as "non-grounded" questions and can be answered by existing text-based question answering systems. This allows for the reuse of existing text-based Open Domain Question Answering (QA) Systems for visual question answering. In this work, we propose a potentially data-efficient approach that reuses existing systems for (a) image analysis, (b) question rewriting, and (c) text-based question answering to answer such visual questions. Given an image and a question pertaining to that image (a visual question), we first extract the entities present in the image using pre-trained object and scene classifiers. Using these detected entities, the visual questions can be rewritten so as to be answerable by open domain QA systems. We explore two rewriting strategies: (1) an unsupervised method using BERT for masking and rewriting, and (2) a weakly supervised approach that combines adaptive rewriting and reinforcement learning techniques to use the implicit feedback from the QA system. We test our strategies on the publicly available OKVQA dataset and obtain a competitive performance with state-of-the-art models while using only 10% of the training data.
The recent advances in machine learning and the availability of free and open big Earth data (e.g., Sentinel missions), which cover large areas with high spatial and temporal resolution, have enabled many agriculture monitoring applications. One example is the control of subsidy allocations of the Common Agricultural Policy (CAP). Advanced remote sensing systems have been developed towards the large-scale evidence-based monitoring of the CAP. Nevertheless, the spatial resolution of satellite images is not always adequate to make accurate decisions for all fields. In this work, we introduce the notion of space-to-ground data availability, i.e., from the satellite to the field, in an attempt to make the best out of the complementary characteristics of the different sources. We present a space-to-ground dataset that contains Sentinel-1 radar and Sentinel-2 optical image time-series, as well as street-level images from the crowdsourcing platform Mapillary, for grassland fields in the area of Utrecht for 2017. The multifaceted utility of our dataset is showcased through the downstream task of grassland classification. We train machine and deep learning algorithms on these different data domains and highlight the potential of fusion techniques towards increasing the reliability of decisions.
Research studies have shown no qualms about using data driven deep learning models for downstream tasks in medical image analysis, e.g., anatomy segmentation and lesion detection, disease diagnosis and prognosis, and treatment planning. However, deep learning models are not the sovereign remedy for medical image analysis when the upstream imaging is not being conducted properly (with artefacts). This has been manifested in MRI studies, where the scanning is typically slow, prone to motion artefacts, with a relatively low signal to noise ratio, and poor spatial and/or temporal resolution. Recent studies have witnessed substantial growth in the development of deep learning techniques for propelling fast MRI. This article aims to (1) introduce the deep learning based data driven techniques for fast MRI including convolutional neural network and generative adversarial network based methods, (2) survey the attention and transformer based models for speeding up MRI reconstruction, and (3) detail the research in coupling physics and data driven models for MRI acceleration. Finally, we will demonstrate through a few clinical applications, explain the importance of data harmonisation and explainable models for such fast MRI techniques in multicentre and multi-scanner studies, and discuss common pitfalls in current research and recommendations for future research directions.