With the development of deep learning, medical image classification has been significantly improved. However, deep learning requires massive data with labels. While labeling the samples by human experts is expensive and time-consuming, collecting labels from crowd-sourcing suffers from the noises which may degenerate the accuracy of classifiers. Therefore, approaches that can effectively handle label noises are highly desired. Unfortunately, recent progress on handling label noise in deep learning has gone largely unnoticed by the medical image. To fill the gap, this paper proposes a noise-tolerant medical image classification framework named Co-Correcting, which significantly improves classification accuracy and obtains more accurate labels through dual-network mutual learning, label probability estimation, and curriculum label correcting. On two representative medical image datasets and the MNIST dataset, we test six latest Learning-with-Noisy-Labels methods and conduct comparative studies. The experiments show that Co-Correcting achieves the best accuracy and generalization under different noise ratios in various tasks. Our project can be found at: https://github.com/JiarunLiu/Co-Correcting.
In the field of waste copper granules recycling, engineers should be able to identify all different sorts of impurities in waste copper granules and estimate their mass proportion relying on experience before rating. This manual rating method is costly, lacking in objectivity and comprehensiveness. To tackle this problem, we propose a waste copper granules rating system based on machine vision and deep learning. We firstly formulate the rating task into a 2D image recognition and purity regression task. Then we design a two-stage convolutional rating network to compute the mass purity and rating level of waste copper granules. Our rating network includes a segmentation network and a purity regression network, which respectively calculate the semantic segmentation heatmaps and purity results of the waste copper granules. After training the rating network on the augmented datasets, experiments on real waste copper granules demonstrate the effectiveness and superiority of the proposed network. Specifically, our system is superior to the manual method in terms of accuracy, effectiveness, robustness, and objectivity.
We propose Mask CycleGAN, a novel architecture for unpaired image domain translation built based on CycleGAN, with an aim to address two issues: 1) unimodality in image translation and 2) lack of interpretability of latent variables. Our innovation in the technical approach is comprised of three key components: masking scheme, generator and objective. Experimental results demonstrate that this architecture is capable of bringing variations to generated images in a controllable manner and is reasonably robust to different masks.
We propose an algorithm to explore the global optimization method, using SAT solvers, for training a neural net. Deep Neural Networks have achieved great feats in tasks like-image recognition, speech recognition, etc. Much of their success can be attributed to the gradient-based optimisation methods, which scale well to huge datasets while still giving solutions, better than any other existing methods. However, there exist learning problems like the parity function and the Fast Fourier Transform, where a neural network using gradient-based optimisation algorithm can not capture the underlying structure of the learning task properly. Thus, exploring global optimisation methods is of utmost interest as the gradient-based methods get stuck in local optima. In the experiments, we demonstrate the effectiveness of our algorithm against the ADAM optimiser in certain tasks like parity learning. However, in the case of image classification on the MNIST Dataset, the performance of our algorithm was less than satisfactory. We further discuss the role of the size of the training dataset and the hyper-parameter settings in keeping things scalable for a SAT solver.
Recent research in disaster informatics demonstrates a practical and important use case of artificial intelligence to save human lives and sufferings during post-natural disasters based on social media contents (text and images). While notable progress has been made using texts, research on exploiting the images remains relatively under-explored. To advance the image-based approach, we propose MEDIC (available at: https://crisisnlp.qcri.org/medic/index.html), which is the largest social media image classification dataset for humanitarian response consisting of 71,198 images to address four different tasks in a multi-task learning setup. This is the first dataset of its kind: social media image, disaster response, and multi-task learning research. An important property of this dataset is its high potential to contribute research on multi-task learning, which recently receives much interest from the machine learning community and has shown remarkable results in terms of memory, inference speed, performance, and generalization capability. Therefore, the proposed dataset is an important resource for advancing image-based disaster management and multi-task machine learning research.
Recently, deep learning-based image enhancement algorithms achieved state-of-the-art (SOTA) performance on several publicly available datasets. However, most existing methods fail to meet practical requirements either for visual perception or for computation efficiency, especially for high-resolution images. In this paper, we propose a novel real-time image enhancer via learnable spatial-aware 3-dimentional lookup tables(3D LUTs), which well considers global scenario and local spatial information. Specifically, we introduce a light weight two-head weight predictor that has two outputs. One is a 1D weight vector used for image-level scenario adaptation, the other is a 3D weight map aimed for pixel-wise category fusion. We learn the spatial-aware 3D LUTs and fuse them according to the aforementioned weights in an end-to-end manner. The fused LUT is then used to transform the source image into the target tone in an efficient way. Extensive results show that our model outperforms SOTA image enhancement methods on public datasets both subjectively and objectively, and that our model only takes about 4ms to process a 4K resolution image on one NVIDIA V100 GPU.
Recent advances in deep learning have led to superhuman performance across a variety of applications. Recently, these methods have been successfully employed to improve the rate-distortion performance in the task of image compression. However, current methods either use additional post-processing blocks on the decoder end to improve compression or propose an end-to-end compression scheme based on heuristics. For the majority of these, the trained deep neural networks (DNNs) are not compatible with standard encoders and would be difficult to deply on personal computers and cellphones. In light of this, we propose a system that learns to improve the encoding performance by enhancing its internal neural representations on both the encoder and decoder ends, an approach we call Neural JPEG. We propose frequency domain pre-editing and post-editing methods to optimize the distribution of the DCT coefficients at both encoder and decoder ends in order to improve the standard compression (JPEG) method. Moreover, we design and integrate a scheme for jointly learning quantization tables within this hybrid neural compression framework.Experiments demonstrate that our approach successfully improves the rate-distortion performance over JPEG across various quality metrics, such as PSNR and MS-SSIM, and generates visually appealing images with better color retention quality.
State-of-the-art methods for semantic segmentation of images involve computationally intensive neural network architectures. Most of these methods are not adaptable to high-resolution image segmentation due to memory and other computational issues. Typical approaches in literature involve design of neural network architectures that can fuse global information from low-resolution images and local information from the high-resolution counterparts. However, architectures designed for processing high resolution images are unnecessarily complex and involve a lot of hyper parameters that can be difficult to tune. Also, most of these architectures require ground truth annotations of the high resolution images to train, which can be hard to obtain. In this article, we develop a robust pipeline based on mathematical morphological (MM) operators that can seamlessly extend any existing semantic segmentation algorithm to high resolution images. Our method does not require the ground truth annotations of the high resolution images. It is based on efficiently utilizing information from the low-resolution counterparts, and gradient information on the high-resolution images. We obtain high quality seeds from the inferred labels on low-resolution images using traditional morphological operators and propagate seed labels using a random walker to refine the semantic labels at the boundaries. We show that the semantic segmentation results obtained by our method beat the existing state-of-the-art algorithms on high-resolution images. We empirically prove the robustness of our approach to the hyper parameters used in our pipeline. Further, we characterize some necessary conditions under which our pipeline is applicable and provide an in-depth analysis of the proposed approach.
Recently, animal pose estimation is attracting increasing interest from the academia (e.g., wildlife and conservation biology) focusing on animal behavior understanding. However, currently animal pose estimation suffers from small datasets and large data variances, making it difficult to obtain robust performance. To tackle this problem, we propose that the rich knowledge about relations between pose-related semantics learned by language models can be utilized to improve the animal pose estimation. Therefore, in this study, we introduce a novel PromptPose framework to effectively apply language models for better understanding the animal poses based on prompt training. In PromptPose, we propose that adapting the language knowledge to the visual animal poses is key to achieve effective animal pose estimation. To this end, we first introduce textual prompts to build connections between textual semantic descriptions and supporting animal keypoint features. Moreover, we further devise a pixel-level contrastive loss to build dense connections between textual descriptions and local image features, as well as a semantic-level contrastive loss to bridge the gap between global contrasts in language-image cross-modal pre-training and local contrasts in dense prediction. In practice, the PromptPose has shown great benefits for improving animal pose estimation. By conducting extensive experiments, we show that our PromptPose achieves superior performance under both supervised and few-shot settings, outperforming representative methods by a large margin. The source code and models will be made publicly available.
This paper presents a new efficient black-box attribution method based on Hilbert-Schmidt Independence Criterion (HSIC), a dependence measure based on Reproducing Kernel Hilbert Spaces (RKHS). HSIC measures the dependence between regions of an input image and the output of a model based on kernel embeddings of distributions. It thus provides explanations enriched by RKHS representation capabilities. HSIC can be estimated very efficiently, significantly reducing the computational cost compared to other black-box attribution methods. Our experiments show that HSIC is up to 8 times faster than the previous best black-box attribution methods while being as faithful. Indeed, we improve or match the state-of-the-art of both black-box and white-box attribution methods for several fidelity metrics on Imagenet with various recent model architectures. Importantly, we show that these advances can be transposed to efficiently and faithfully explain object detection models such as YOLOv4. Finally, we extend the traditional attribution methods by proposing a new kernel enabling an orthogonal decomposition of importance scores based on HSIC, allowing us to evaluate not only the importance of each image patch but also the importance of their pairwise interactions.