Ultrasound image degradation in the human body is complex and occurs due to the distortion of the wave as it propagates to and from the target. Here, we establish a simulation based framework that deconstructs the sources of image degradation into a separable parameter space that includes phase aberration from speed variation, multiple reverberations, and trailing reverberation. These separable parameters are then used to reconstruct images with known and independently modulable amounts of degradation using methods that depend on the additive or multiplicative nature of the degradation. Experimental measurements and Fullwave simulations in the human abdomen demonstrate this calibrated process in abdominal imaging by matching relevant imaging metrics such as phase aberration, reverberation strength, speckle brightness and coherence length. Applications of the reconstruction technique are illustrated for beamforming strategies (phase aberration correction, spatial coherence imaging), in a standard abdominal environment, as well as in impedance ranges much higher than those naturally occurring in the body.
Renovating the memories in old photos is an intriguing research topic in computer vision fields. These legacy images often suffer from severe and commingled degradations such as cracks, noise, and color-fading, while lack of large-scale paired old photo datasets makes this restoration task very challenging. In this work, we present a novel reference-based end-to-end learning framework that can jointly repair and colorize the degraded legacy pictures. Specifically, the proposed framework consists of three modules: a restoration sub-network for degradation restoration, a similarity sub-network for color histogram matching and transfer, and a colorization subnet that learns to predict the chroma elements of the images conditioned on chromatic reference signals. The whole system takes advantage of the color histogram priors in a given reference image, which vastly reduces the dependency on large-scale training data. Apart from the proposed method, we also create, to our knowledge, the first public and real-world old photo dataset with paired ground truth for evaluating old photo restoration models, wherein each old photo is paired with a manually restored pristine image by PhotoShop experts. Our extensive experiments conducted on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-arts both quantitatively and qualitatively.
We investigate the optimal choice of replacement layer for Batch Normalisation (BN) in residual networks (ResNets) for training with Differentially Private Stochastic Gradient Descent (DP-SGD) and study the phenomenon of scale mixing in residual blocks, whereby the activations on the two branches are scaled differently. Our experimental evaluation indicates that a hyperparameter search over 1-64 Group Normalisation (GN) groups improves the accuracy of ResNet-9 and ResNet-50 considerably in both benchmark (CIFAR-10) and large-image (ImageNette) tasks. Moreover, Scale Normalisation, a simple modification to the model architecture by which an additional normalisation layer is introduced after the residual block's addition operation further improves the utility of ResNets allowing us to achieve state-of-the-art results on CIFAR-10.
We present the DeepHist - a novel Deep Learning framework for augmenting a network by histogram layers and demonstrate its strength by addressing image-to-image translation problems. Specifically, given an input image and a reference color distribution we aim to generate an output image with the structural appearance (content) of the input (source) yet with the colors of the reference. The key idea is a new technique for a differentiable construction of joint and color histograms of the output images. We further define a color distribution loss based on the Earth Mover's Distance between the output's and the reference's color histograms and a Mutual Information loss based on the joint histograms of the source and the output images. Promising results are shown for the tasks of color transfer, image colorization and edges $\rightarrow$ photo, where the color distribution of the output image is controlled. Comparison to Pix2Pix and CyclyGANs are shown.
Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame. A valid question is how to define "useful information" and then distill it from a video sequence down to one synthetic frame. This paper presents a novel Informative Frame Synthesis (IFS) architecture that incorporates three objective tasks, i.e., appearance reconstruction, video categorization, motion estimation, and two regularizers, i.e., adversarial learning, color consistency. Each task equips the synthetic frame with one ability, while each regularizer enhances its visual quality. With these, by jointly learning the frame synthesis in an end-to-end manner, the generated frame is expected to encapsulate the required spatio-temporal information useful for video analysis. Extensive experiments are conducted on the large-scale Kinetics dataset. When comparing to baseline methods that map video sequence to a single image, IFS shows superior performance. More remarkably, IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks, and achieves comparable performance with the state-of-the-art methods with less computational cost.
Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient pixel-level annotations, which is time-consuming and expensive. To enrich the label resources of LUAD and to alleviate the annotation efforts, we organize this challenge WSSS4LUAD to call for the outstanding weakly-supervised semantic segmentation (WSSS) techniques for histopathology images of LUAD. Participants have to design the algorithm to segment tumor epithelial, tumor-associated stroma and normal tissue with only patch-level labels. This challenge includes 10,091 patch-level annotations (the training set) and over 130 million labeled pixels (the validation and test sets), from 87 WSIs (67 from GDPH, 20 from TCGA). All the labels were generated by a pathologist-in-the-loop pipeline with the help of AI models and checked by the label review board. Among 532 registrations, 28 teams submitted the results in the test phase with over 1,000 submissions. Finally, the first place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919). According to the technical reports of the top-tier teams, CAM is still the most popular approach in WSSS. Cutmix data augmentation has been widely adopted to generate more reliable samples. With the success of this challenge, we believe that WSSS approaches with patch-level annotations can be a complement to the traditional pixel annotations while reducing the annotation efforts. The entire dataset has been released to encourage more researches on computational pathology in LUAD and more novel WSSS techniques.
Few-shot video object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. In this paper, we present a simple but effective temporal transductive inference (TTI) approach that leverages temporal consistency in the unlabelled video frames during few-shot inference. Key to our approach is the use of both global and local temporal constraints. The objective of the global constraint is to learn consistent linear classifiers for novel classes across the image sequence, whereas the local constraint enforces the proportion of foreground/background regions in each frame to be coherent across a local temporal window. These constraints act as spatiotemporal regularizers during the transductive inference to increase temporal coherence and reduce overfitting on the few-shot support set. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%. In addition, we introduce improved benchmarks that are exhaustively labelled (i.e. all object occurrences are labelled, unlike the currently available), and present a more realistic evaluation paradigm that targets data distribution shift between training and testing sets. Our empirical results and in-depth analysis confirm the added benefits of the proposed spatiotemporal regularizers to improve temporal coherence and overcome certain overfitting scenarios.
Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as feature extractors in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture-biased predictions.
Automatically generating medical reports for retinal images is one of the promising ways to help ophthalmologists reduce their workload and improve work efficiency. In this work, we propose a new context-driven encoding network to automatically generate medical reports for retinal images. The proposed model is mainly composed of a multi-modal input encoder and a fused-feature decoder. Our experimental results show that our proposed method is capable of effectively leveraging the interactive information between the input image and context, i.e., keywords in our case. The proposed method creates more accurate and meaningful reports for retinal images than baseline models and achieves state-of-the-art performance. This performance is shown in several commonly used metrics for the medical report generation task: BLEU-avg (+16%), CIDEr (+10.2%), and ROUGE (+8.6%).
Face images are rich data items that are useful and can easily be collected in many applications, such as in 1-to-1 face verification tasks in the domain of security and surveillance systems. Multiple methods have been proposed to protect an individual's privacy by perturbing the images to remove traces of identifiable information, such as gender or race. However, significantly less attention has been given to the problem of protecting images while maintaining optimal task utility. In this paper, we study the novel problem of creating privacy-preserving image representations with respect to a given utility task by proposing a principled framework called the Adversarial Image Anonymizer (AIA). AIA first creates an image representation using a generative model, then enhances the learned image representations using adversarial learning to preserve privacy and utility for a given task. Experiments were conducted on a publicly available data set to demonstrate the effectiveness of AIA as a privacy-preserving mechanism for face images.