Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.
With the increasing application of robots, stable and efficient Visual Odometry (VO) algorithms are becoming more and more important. Based on the Fourier Mellin Transformation (FMT) algorithm, the extended Fourier Mellin Transformation (eFMT) is an image registration approach that can be applied to downward-looking cameras, for example on aerial and underwater vehicles. eFMT extends FMT to multi-depth scenes and thus more application scenarios. It is a visual odometry method which estimates the pose transformation between three overlapping images. On this basis, we develop an optimized eFMT algorithm that improves certain aspects of the method and combines it with back-end optimization for the small loop of three consecutive frames. For this we investigate the extraction of uncertainty information from the eFMT registration, the related objective function and the graph-based optimization. Finally, we design a series of experiments to investigate the properties of this approach and compare it with other VO and SLAM (Simultaneous Localization and Mapping) algorithms. The results show the superior accuracy and speed of our o-eFMT approach, which is published as open source.
For a manifold embedded in an inner product space, we express geometric quantities such as {\it Hamilton vector fields, affine and Levi-Civita connections, curvature} in global coordinates. Instead of coordinate indices, the global formulas for most quantities are expressed as {\it operator-valued} expressions, using an {\it affine projection} to the tangent bundle. For a submersion image of an embedded manifold, we introduce {\it liftings} of Hamilton vector fields, allowing us to use embedded coordinates on horizontal bundles. We derive a {\it Gauss-Codazzi equation} for affine connections on vector bundles. This approach allows us to evaluate geometric expressions globally, and could be used effectively with modern numerical frameworks in applications. Examples considered include rigid body mechanics and Hamilton mechanics on Grassmann manifolds. We show explicitly the cross-curvature (MTW-tensor) for the {\it Kim-McCann} metric with a reflector antenna-type cost function on the space of positive-semidefinite matrices of fixed rank has nonnegative cross-curvature, while the corresponding cost could have negative cross-curvature on Grassmann manifolds, except for projective spaces.
Widefield microscopy is widely used for non-invasive imaging of biological structures at subcellular resolution. When applied to complex specimen, its image quality is degraded by sample-induced optical aberration. Adaptive optics can correct wavefront distortion and restore diffraction-limited resolution but require wavefront sensing and corrective devices, increasing system complexity and cost. Here, we describe a self-supervised machine learning algorithm, CoCoA, that performs joint wavefront estimation and three-dimensional structural information extraction from a single input 3D image stack without the need for external training dataset. We implemented CoCoA for widefield imaging of mouse brain tissues and validated its performance with direct-wavefront-sensing-based adaptive optics. Importantly, we systematically explored and quantitatively characterized the limiting factors of CoCoA's performance. Using CoCoA, we demonstrated the first in vivo widefield mouse brain imaging using machine-learning-based adaptive optics. Incorporating coordinate-based neural representations and a forward physics model, the self-supervised scheme of CoCoA should be applicable to microscopy modalities in general.
Although recent personalization methods have democratized high-resolution image synthesis by enabling swift concept acquisition with minimal examples and lightweight computation, they also present an exploitable avenue for high accessible backdoor attacks. This paper investigates a critical and unexplored aspect of text-to-image (T2I) diffusion models - their potential vulnerability to backdoor attacks via personalization. Our study focuses on a zero-day backdoor vulnerability prevalent in two families of personalization methods, epitomized by Textual Inversion and DreamBooth.Compared to traditional backdoor attacks, our proposed method can facilitate more precise, efficient, and easily accessible attacks with a lower barrier to entry. We provide a comprehensive review of personalization in T2I diffusion models, highlighting the operation and exploitation potential of this backdoor vulnerability. To be specific, by studying the prompt processing of Textual Inversion and DreamBooth, we have devised dedicated backdoor attacks according to the different ways of dealing with unseen tokens and analyzed the influence of triggers and concept images on the attack effect. Our empirical study has shown that the nouveau-token backdoor attack has better attack performance while legacy-token backdoor attack is potentially harder to defend.
Immunohistochemistry (IHC) biomarkers are essential tools for reliable cancer diagnosis and subtyping. It requires cross-staining comparison among Whole Slide Images (WSIs) of IHCs and hematoxylin and eosin (H&E) slides. Currently, pathologists examine the visually co-localized areas across IHC and H&E glass slides for a final diagnosis, which is a tedious and challenging task. Moreover, visually inspecting different IHC slides back and forth to analyze local co-expressions is inherently subjective and prone to error, even when carried out by experienced pathologists. Relying on digital pathology, we propose Composite Biomarker Image (CBI) in this work. CBI is a single image that can be composed using different filtered IHC biomarker images for better visualization. We present a CBI image produced in two steps by the proposed solution for better visualization and hence more efficient clinical workflow. In the first step, IHC biomarker images are aligned with the H&E images using one coordinate system and orientation. In the second step, the positive or negative IHC regions from each biomarker image (based on the pathologists recommendation) are filtered and combined into one image using a fuzzy inference system. For evaluation, the resulting CBI images, from the proposed system, were evaluated qualitatively by the expert pathologists. The CBI concept helps the pathologists to identify the suspected target tissues more easily, which could be further assessed by examining the actual WSIs at the same suspected regions.
Leveraging medical record information in the era of big data and machine learning comes with the caveat that data must be cleaned and deidentified. Facilitating data sharing and harmonization for multi-center collaborations are particularly difficult when protected health information (PHI) is contained or embedded in image meta-data. We propose a novel library in the Python framework, called PyLogik, to help alleviate this issue for ultrasound images, which are particularly challenging because of the frequent inclusion of PHI directly on the images. PyLogik processes the image volumes through a series of text detection/extraction, filtering, thresholding, morphological and contour comparisons. This methodology deidentifies the images, reduces file sizes, and prepares image volumes for applications in deep learning and data sharing. To evaluate its effectiveness in the identification of regions of interest (ROI), a random sample of 50 cardiac ultrasounds (echocardiograms) were processed through PyLogik, and the outputs were compared with the manual segmentations by an expert user. The Dice coefficient of the two approaches achieved an average value of 0.976. Next, an investigation was conducted to ascertain the degree of information compression achieved using the algorithm. Resultant data was found to be on average approximately 72% smaller after processing by PyLogik. Our results suggest that PyLogik is a viable methodology for ultrasound data cleaning and deidentification, determining ROI, and file compression which will facilitate efficient storage, use, and dissemination of ultrasound data.
Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., ``a cat and a dog''. However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks. More videos and updates can be found on the project page \url{https://sites.google.com/view/divide-and-bind}.
Cross-silo federated learning (FL) enables the development of machine learning models on datasets distributed across data centers such as hospitals and clinical research laboratories. However, recent research has found that current FL algorithms face a trade-off between local and global performance when confronted with distribution shifts. Specifically, personalized FL methods have a tendency to overfit to local data, leading to a sharp valley in the local model and inhibiting its ability to generalize to out-of-distribution data. In this paper, we propose a novel federated model soup method (i.e., selective interpolation of model parameters) to optimize the trade-off between local and global performance. Specifically, during the federated training phase, each client maintains its own global model pool by monitoring the performance of the interpolated model between the local and global models. This allows us to alleviate overfitting and seek flat minima, which can significantly improve the model's generalization performance. We evaluate our method on retinal and pathological image classification tasks, and our proposed method achieves significant improvements for out-of-distribution generalization. Our code is available at https://github.com/ubc-tea/FedSoup.
Current object detection models have achieved good results on many benchmark datasets, detecting objects in dark conditions remains a large challenge. To address this issue, we propose a pyramid enhanced network (PENet) and joint it with YOLOv3 to build a dark object detection framework named PE-YOLO. Firstly, PENet decomposes the image into four components of different resolutions using the Laplacian pyramid. Specifically we propose a detail processing module (DPM) to enhance the detail of images, which consists of context branch and edge branch. In addition, we propose a low-frequency enhancement filter (LEF) to capture low-frequency semantics and prevent high-frequency noise. PE-YOLO adopts an end-to-end joint training approach and only uses normal detection loss to simplify the training process. We conduct experiments on the low-light object detection dataset ExDark to demonstrate the effectiveness of ours. The results indicate that compared with other dark detectors and low-light enhancement models, PE-YOLO achieves the advanced results, achieving 78.0% in mAP and 53.6 in FPS, respectively, which can adapt to object detection under different low-light conditions. The code is available at https://github.com/XiangchenYin/PE-YOLO.