Adequate blood supply is critical for normal brain function. Brain vasculature dysfunctions such as stalled blood flow in cerebral capillaries are associated with cognitive decline and pathogenesis in Alzheimer's disease. Recent advances in imaging technology enabled generation of high-quality 3D images that can be used to visualize stalled blood vessels. However, localization of stalled vessels in 3D images is often required as the first step for downstream analysis, which can be tedious, time-consuming and error-prone, when done manually. Here, we describe a deep learning-based approach for automatic detection of stalled capillaries in brain images based on 3D convolutional neural networks. Our networks employed custom 3D data augmentations and were used weight transfer from pre-trained 2D models for initialization. We used an ensemble of several 3D models to produce the winning submission to the Clog Loss: Advance Alzheimer's Research with Stall Catchers machine learning competition that challenged the participants with classifying blood vessels in 3D image stacks as stalled or flowing. In this setting, our approach outperformed other methods and demonstrated state-of-the-art results, achieving 0.85 Matthews correlation coefficient, 85% sensitivity, and 99.3% specificity. The source code for our solution is made publicly available.
In this paper, we consider the problem of learning landmarks for object categories without any manual annotations. We cast this as the problem of conditionally generating an image of an object from another one, where the images differ by acquisition time and/or viewpoint. The process is aided by providing the generator with a keypoint-like representation extracted from the target image through a tight bottleneck. This encourages the representation to distil information about the object geometry, which changes from source to target, while the appearance, which is shared between the source and target, is read off from the source alone. Conditioning simplifies the generation task significantly, to the point that adopting a simple perceptual loss instead of more sophisticated approaches such as adversarial training is sufficient to learn landmarks. We show that our method is applicable to a large variety of datasets - faces, people, 3D objects, and digits - without any modifications. We further demonstrate that we can learn landmarks from synthetic image deformations or videos, all without manual supervision, while outperforming state-of-the-art unsupervised landmark detectors.
Our contribution is a unified cross-modality feature disentagling approach for multi-domain image translation and multiple organ segmentation. Using CT as the labeled source domain, our approach learns to segment multi-modal (T1-weighted and T2-weighted) MRI having no labeled data. Our approach uses a variational auto-encoder (VAE) to disentangle the image content from style. The VAE constrains the style feature encoding to match a universal prior (Gaussian) that is assumed to span the styles of all the source and target modalities. The extracted image style is converted into a latent style scaling code, which modulates the generator to produce multi-modality images according to the target domain code from the image content features. Finally, we introduce a joint distribution matching discriminator that combines the translated images with task-relevant segmentation probability maps to further constrain and regularize image-to-image (I2I) translations. We performed extensive comparisons to multiple state-of-the-art I2I translation and segmentation methods. Our approach resulted in the lowest average multi-domain image reconstruction error of 1.34$\pm$0.04. Our approach produced an average Dice similarity coefficient (DSC) of 0.85 for T1w and 0.90 for T2w MRI for multi-organ segmentation, which was highly comparable to a fully supervised MRI multi-organ segmentation network (DSC of 0.86 for T1w and 0.90 for T2w MRI).
In the last few years, several deep learning models, especially Generative Adversarial Networks have received a lot of attention for the task of Single Image Super-Resolution (SISR). These methods focus on building an end-to-end framework, which produce a high resolution(SR) image from a given low resolution(LR) image in a single step to achieve state-of-the-art performance. This paper focuses on improving an existing deep-learning based method to perform Super-Resolution Microscopy in real-time using a standard GPU. For this, we first propose a tiling strategy, which takes advantage of parallelism provided by a GPU to speed up the network training process. Further, we suggest simple changes to the architecture of the generator and the discriminator of SRGAN. Subsequently, We compare the quality and the running time for the outputs produced by our model, opening its applications in different areas like low-end benchtop and even mobile microscopy. Finally, we explore the possibility of the trained network to produce High-Resolution HR outputs for different domains.
Accurate localization is fundamental to a variety of applications, such as navigation, robotics, autonomous driving, and Augmented Reality (AR). Different from incremental localization, global localization has no drift caused by error accumulation, which is desired in many application scenarios. In addition to GPS used in the open air, 3D maps are also widely used as alternative global localization references. In this paper, we propose a compact 3D map-based global localization system using a low-cost monocular camera and an IMU (Inertial Measurement Unit). The proposed compact map consists of two types of simplified elements with multiple semantic labels, which is well adaptive to various man-made environments like urban environments. Also, semantic edge features are used for the key image-map registration, which is robust against occlusion and long-term appearance changes in the environments. To further improve the localization performance, the key semantic edge alignment is formulated as an optimization problem based on initial poses predicted by an independent VIO (Visual-Inertial Odometry) module. The localization system is realized with modular design in real time. We evaluate the localization accuracy through real-world experimental results compared with ground truth, long-term localization performance is also demonstrated.
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world. Concurrently, image completion has aimed to create plausible appearance for the invisible regions, but requires a manual mask as input. In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene. Particularly, we built a system to decompose a scene into individual objects, infer their underlying occlusion relationships, and even automatically learn which parts of the objects are occluded that need to be completed. In order to disentangle the occluded relationships of all objects in a complex scene, we use the fact that the front object without being occluded is easy to be identified, detected, and segmented. Our system interleaves the two tasks of instance segmentation and scene completion through multiple iterations, solving for objects layer-by-layer. We first provide a thorough experiment using a new realistically rendered dataset with ground-truths for all invisible regions. To bridge the domain gap to real imagery where ground-truths are unavailable, we then train another model with the pseudo-ground-truths generated from our trained synthesis model. We demonstrate results on a wide variety of datasets and show significant improvement over the state-of-the-art.
Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e.g., what color is the banana?) based on the high-frequency answers (e.g., yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency while strengthening image dependency. However, they are still subject to the language prior problem since the data biases have not been even alleviated. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and propose a self-supervised auxiliary task to utilize the balanced data to assist the base VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method can significantly outperform the state-of-the-art, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations.
With benefits of fast query speed and low storage cost, hashing-based image retrieval approaches have garnered considerable attention from the research community. In this paper, we propose a novel Error-Corrected Deep Cross Modal Hashing (CMH-ECC) method which uses a bitmap specifying the presence of certain facial attributes as an input query to retrieve relevant face images from the database. In this architecture, we generate compact hash codes using an end-to-end deep learning module, which effectively captures the inherent relationships between the face and attribute modality. We also integrate our deep learning module with forward error correction codes to further reduce the distance between different modalities of the same subject. Specifically, the properties of deep hashing and forward error correction codes are exploited to design a cross modal hashing framework with high retrieval performance. Experimental results using two standard datasets with facial attributes-image modalities indicate that our CMH-ECC face image retrieval model outperforms most of the current attribute-based face image retrieval approaches.
In the past years, face recognition technologies have shown impressive recognition performance, mainly due to recent developments in deep convolutional neural networks. Notwithstanding those improvements, several challenges which affect the performance of face recognition systems remain. In this work, we investigate the impact that facial tattoos and paintings have on current face recognition systems. To this end, we first collected an appropriate database containing image-pairs of individuals with and without facial tattoos or paintings. The assembled database was used to evaluate how facial tattoos and paintings affect the detection, quality estimation, as well as the feature extraction and comparison modules of a face recognition system. The impact on these modules was evaluated using state-of-the-art open-source and commercial systems. The obtained results show that facial tattoos and paintings affect all the tested modules, especially for images where a large area of the face is covered with tattoos or paintings. Our work is an initial case-study and indicates a need to design algorithms which are robust to the visual changes caused by facial tattoos and paintings.
X-ray imaging is a widely used technique for non-destructive inspection of agricultural food products. One application of X-ray imaging is the autonomous, in-line detection of foreign objects in food samples. Examples of such inclusions are bone fragments in meat products, plastic and metal debris in fish, fruit infestations. This article presents a processing methodology for unsupervised foreign object detection based on dual-energy X-ray absorptiometry (DEXA). A foreign object is defined as a fragment of material with different X-ray attenuation properties than those belonging to the food product. A novel thickness correction model is introduced as a pre-processing technique for DEXA data. The aim of the model is to homogenize regions in the image that belong to the food product and enhance contrast where the foreign object is present. In this way, the segmentation of the foreign object is more robust to noise and lack of contrast. The proposed methodology was applied to a dataset of 488 samples of meat products. The samples were acquired from a conveyor belt in a food processing factory. Approximately 60\% of the samples contain foreign objects of different types and sizes, while the rest of the samples are void of foreign objects. The results show that samples without foreign objects are correctly identified in 97% of cases, the overall accuracy of foreign object detection reaches 95%.