Coronavirus Disease 2019 (COVID-19) has spread aggressively across the world causing an existential health crisis. Thus, having a system that automatically detects COVID-19 in tomography (CT) images can assist in quantifying the severity of the illness. Unfortunately, labelling chest CT scans requires significant domain expertise, time, and effort. We address these labelling challenges by only requiring point annotations, a single pixel for each infected region on a CT image. This labeling scheme allows annotators to label a pixel in a likely infected region, only taking 1-3 seconds, as opposed to 10-15 seconds to segment a region. Conventionally, segmentation models train on point-level annotations using the cross-entropy loss function on these labels. However, these models often suffer from low precision. Thus, we propose a consistency-based (CB) loss function that encourages the output predictions to be consistent with spatial transformations of the input images. The experiments on 3 open-source COVID-19 datasets show that this loss function yields significant improvement over conventional point-level loss functions and almost matches the performance of models trained with full supervision with much less human effort. Code is available at: \url{https://github.com/IssamLaradji/covid19_weak_supervision}.
Learning to infer graph representations and performing spatial reasoning in a complex surgical environment can play a vital role in surgical scene understanding in robotic surgery. For this purpose, we develop an approach to generate the scene graph and predict surgical interactions between instruments and surgical region of interest (ROI) during robot-assisted surgery. We design an attention link function and integrate with a graph parsing network to recognize the surgical interactions. To embed each node with corresponding neighbouring node features, we further incorporate SageConv into the network. The scene graph generation and active edge classification mostly depend on the embedding or feature extraction of node and edge features from complex image representation. Here, we empirically demonstrate the feature extraction methods by employing label smoothing weighted loss. Smoothing the hard label can avoid the over-confident prediction of the model and enhances the feature representation learned by the penultimate layer. To obtain the graph scene label, we annotate the bounding box and the instrument-ROI interactions on the robotic scene segmentation challenge 2018 dataset with an experienced clinical expert in robotic surgery and employ it to evaluate our propositions.
We propose a framework that estimates inundation depth (maximum water level) and debris-flow-induced topographic deformation from remote sensing imagery by integrating deep learning and numerical simulation. A water and debris flow simulator generates training data for various artificial disaster scenarios. We show that regression models based on Attention U-Net and LinkNet architectures trained on such synthetic data can predict the maximum water level and topographic deformation from a remote sensing-derived change detection map and a digital elevation model. The proposed framework has an inpainting capability, thus mitigating the false negatives that are inevitable in remote sensing image analysis. Our framework breaks the limits of remote sensing and enables rapid estimation of inundation depth and topographic deformation, essential information for emergency response, including rescue and relief activities. We conduct experiments with both synthetic and real data for two disaster events that caused simultaneous flooding and debris flows and demonstrate the effectiveness of our approach quantitatively and qualitatively.
Although fiducial markers give an accurate pose estimation in laboratory conditions, where the noisy factors are controlled, using them in field robotic applications remains a challenge. This is constrained to the fiducial maker systems, since they only work within the RGB image space. As a result, noises in the image produce large pose estimation errors. In robotic applications, fiducial markers have been mainly used in its original and simple form, as a plane in a printed paper sheet. This setup is sufficient for basic visual servoing and augmented reality applications, but not for complex swarm robotic applications in which the setup consists of multiple dynamic markers (tags displayed on LCD screen). This paper describes a novel methodology, called AprilTags3D, that improves pose estimation accuracy of AprilTags in field robotics with only RGB sensor by adding a third dimension to the marker detector. Also, presents experimental results from applying the proposed methodology to swarm autonomous robotic boats for latching between them and for creating robotic formations.
Recovering the 3D shape of an object from single or multiple images with deep neural networks has been attracting increasing attention in the past few years. Mainstream works (e.g. 3D-R2N2) use recurrent neural networks (RNNs) to sequentially fuse feature maps of input images. However, RNN-based approaches are unable to produce consistent reconstruction results when given the same input images with different orders. Moreover, RNNs may forget important features from early input images due to long-term memory loss. To address these issues, we propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume. To further correct the wrongly recovered parts in the fused 3D volume, a refiner is adopted to generate the final output. Experimental results on the ShapeNet, Pix3D, and Things3D benchmarks show that Pix2Vox++ performs favorably against state-of-the-art methods in terms of both accuracy and efficiency.
Learning Enabled Components (LECs) are widely being used in a variety of perception based autonomy tasks like image segmentation, object detection, end-to-end driving, etc. These components are trained with large image datasets with multimodal factors like weather conditions, time-of-day, traffic-density, etc. The LECs learn from these factors during training, and while testing if there is variation in any of these factors, the components get confused resulting in low confidence predictions. The images with factors not seen during training is commonly referred to as Out-of-Distribution (OOD). For safe autonomy it is important to identify the OOD images, so that a suitable mitigation strategy can be performed. Classical one-class classifiers like SVM and SVDD are used to perform OOD detection. However, the multiple labels attached to the images in these datasets, restricts the direct application of these techniques. We address this problem using the latent space of the $\beta$-Variational Autoencoder ($\beta$-VAE). We use the fact that compact latent space generated by an appropriately selected $\beta$-VAE will encode the information about these factors in a few latent variables, and that can be used for computationally inexpensive detection. We evaluate our approach on the nuScenes dataset, and our results shows the latent space of $\beta$-VAE is sensitive to encode changes in the values of the generative factor.
Denoisers trained with synthetic data often fail to cope with the diversity of unknown noises, giving way to methods that can adapt to existing noise without knowing its ground truth. Previous image-based method leads to noise overfitting if directly applied to video denoisers, and has inadequate temporal information management especially in terms of occlusion and lighting variation, which considerably hinders its denoising performance. In this paper, we propose a general framework for video denoising networks that successfully addresses these challenges. A novel twin sampler assembles training data by decoupling inputs from targets without altering semantics, which not only effectively solves the noise overfitting problem, but also generates better occlusion masks efficiently by checking optical flow consistency. An online denoising scheme and a warping loss regularizer are employed for better temporal alignment. Lighting variation is quantified based on the local similarity of aligned frames. Our method consistently outperforms the prior art by 0.6-3.2dB PSNR on multiple noises, datasets and network architectures. State-of-the-art results on reducing model-blind video noises are achieved. Extensive ablation studies are conducted to demonstrate the significance of each technical components.
The collection of large datasets allows for advanced analytics that can lead to improved quality of life and progress in applications such as machine cognition and medical analysis. However, recently there has been an increased pressure to guarantee the privacy of users when collecting data. In this work, we study how adversarial representation learning can be used to ensure the privacy of users, and to obfuscate sensitive attributes in existing datasets. While previous methods using adversarial representation learning for privacy only aims at obfuscating the sensitive information, we find that adding new information in its place can improve the strength of the provided privacy. We propose a method building on generative adversarial networks that has two steps in the data privatization. In the first step, sensitive data is removed from the representation. In the second step, a sample which is independent of the input data is inserted in its place. The result is an approach that can provide stronger privatization on image data, and yet be preserving both the domain and the utility of the inputs.
Retinopathy of Prematurity (ROP) is an eye disorder primarily affecting premature infants with lower weights. It causes proliferation of vessels in the retina and could result in vision loss and, eventually, retinal detachment, leading to blindness. While human experts can easily identify severe stages of ROP, the diagnosis of earlier stages, which are the most relevant to determining treatment choice, are much more affected by variability in subjective interpretations of human experts. In recent years, there has been a significant effort to automate the diagnosis using deep learning. This paper builds upon the success of previous models and develops a novel architecture, which combines object segmentation and convolutional neural networks (CNN) to construct an effective classifier of ROP stages 1-3 based on neonatal retinal images. Motivated by the fact that the formation and shape of a demarcation line in the retina is the distinguishing feature between earlier ROP stages, our proposed system first trains an object segmentation model to identify the demarcation line at a pixel level and adds the resulting mask as an additional "color" channel in the original image. Then, the system trains a CNN classifier based on the processed images to leverage information from both the original image and the mask, which helps direct the model's attention to the demarcation line. In a number of careful experiments comparing its performance to previous object segmentation systems and CNN-only systems trained on our dataset, our novel architecture significantly outperforms previous systems in accuracy, demonstrating the effectiveness of our proposed pipeline.
3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the availability of multiple facial images (sometimes from the same subject) as input, and must address a number of methodological challenges such as establishing dense correspondences across large facial poses, expressions, and non-uniform illumination. In general these methods require complex and inefficient pipelines for model building and fitting. In this work, we propose to address many of these limitations by training a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. Our CNN works with just a single 2D facial image, does not require accurate alignment nor establishes dense correspondence between images, works for arbitrary facial poses and expressions, and can be used to reconstruct the whole 3D facial geometry (including the non-visible parts of the face) bypassing the construction (during training) and fitting (during testing) of a 3D Morphable Model. We achieve this via a simple CNN architecture that performs direct regression of a volumetric representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related task of facial landmark localization can be incorporated into the proposed framework and help improve reconstruction quality, especially for the cases of large poses and facial expressions. Testing code will be made available online, along with pre-trained models http://aaronsplace.co.uk/papers/jackson2017recon