Recent image-to-image translation works have been transferred from supervised to unsupervised settings due to the expensive cost of capturing or labeling large amounts of paired data. However, current unsupervised methods using the cycle-consistency constraint may not find the desired mapping, especially for difficult translation tasks. On the other hand, a small number of paired data are usually accessible. We therefore introduce a general framework for semi-supervised image translation. Unlike previous works, our main idea is to learn the translation over the latent feature space instead of the image space. Thanks to the low dimensional feature space, it is easier to find the desired mapping function, resulting in improved quality of translation results as well as the stability of the translation model. Empirically we show that using feature translation generates better results, even using a few bits of paired data. Experimental comparisons with state-of-the-art approaches demonstrate the effectiveness of the proposed framework on a variety of challenging image-to-image translation tasks
We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
Machine learning is being widely applied to analyze satellite data with problems such as classification and feature detection. Unlike traditional image processing algorithms, geospatial applications need to convert the detected objects from a raster form to a geospatial vector form to further analyze it. This gem delivers a simple and light-weight algorithm for delineating the pixels that are marked by ML algorithms to extract geospatial objects from satellite images. The proposed algorithm is exact and users can further apply simplification and approximation based on the application needs.
Deep learning methods have achieved great success in solving computer vision tasks, and they have been widely utilized in artificially intelligent systems for image processing, analysis, and understanding. However, deep neural networks have been shown to be vulnerable to adversarial perturbations in input data. The security issues of deep neural networks have thus come to the fore. It is imperative to study the adversarial robustness of deep vision algorithms comprehensively. This talk focuses on the adversarial robustness of image classification models and image denoisers. We will discuss the robustness of deep vision algorithms from three perspectives: 1) robustness evaluation (we propose the ObsAtk to evaluate the robustness of denoisers), 2) robustness improvement (HAT, TisODE, and CIFS are developed to robustify vision models), and 3) the connection between adversarial robustness and generalization capability to new domains (we find that adversarially robust denoisers can deal with unseen types of real-world noise).
In this study, a novel multiple-frame based image and texture independent convolutional Neural Network (CNN) noise estimator is introduced. The estimator works.
Advances in computer vision and machine learning techniques have led to significant development in 2D and 3D human pose estimation from RGB cameras, LiDAR, and radars. However, human pose estimation from images is adversely affected by occlusion and lighting, which are common in many scenarios of interest. Radar and LiDAR technologies, on the other hand, need specialized hardware that is expensive and power-intensive. Furthermore, placing these sensors in non-public areas raises significant privacy concerns. To address these limitations, recent research has explored the use of WiFi antennas (1D sensors) for body segmentation and key-point body detection. This paper further expands on the use of the WiFi signal in combination with deep learning architectures, commonly used in computer vision, to estimate dense human pose correspondence. We developed a deep neural network that maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions. The results of the study reveal that our model can estimate the dense pose of multiple subjects, with comparable performance to image-based approaches, by utilizing WiFi signals as the only input. This paves the way for low-cost, broadly accessible, and privacy-preserving algorithms for human sensing.
Currently, transformer-based algorithms are making a splash in the domain of image deblurring. Their achievement depends on the self-attention mechanism with CNN stem to model long range dependencies between tokens. Unfortunately, this ear-pleasing pipeline introduces high computational complexity and makes it difficult to run an ultra-high-definition image on a single GPU in real time. To trade-off accuracy and efficiency, the input degraded image is computed cyclically over three dimensional ($C$, $W$, and $H$) signals without a self-attention mechanism. We term this deep network as Multi-scale Cubic-Mixer, which is acted on both the real and imaginary components after fast Fourier transform to estimate the Fourier coefficients and thus obtain a deblurred image. Furthermore, we combine the multi-scale cubic-mixer with a slicing strategy to generate high-quality results at a much lower computational cost. Experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art deblurring approaches on the several benchmarks and a new ultra-high-definition dataset in terms of accuracy and speed.
The goal of our research is to create a comprehensive and flexible library that is easy to use for medical imaging research, and capable of handling grayscale images, multiple inputs (both images and tabular data), and multi-label tasks. We have named it Nervus. Based on the PyTorch library, which is suitable for AI for research purposes, we created a four-part model to handle comprehensive inputs and outputs. Nervus consists of four parts. First is the dataloader, then the feature extractor, the feature mixer, and finally the classifier. The dataloader preprocesses the input data, the feature extractor extracts the features between the training data and ground truth labels, feature mixer mixes the features of the extractors, and the classifier classifies the input data from feature mixer based on the task. We have created Nervus, which is a comprehensive and flexible model library that is easy to use for medical imaging research which can handle grayscale images, multi-inputs and multi-label tasks. This will be helpful for researchers in the field of radiology.
Image registration is an essential but challenging task in medical image computing, especially for echocardiography, where the anatomical structures are relatively noisy compared to other imaging modalities. Traditional (non-learning) registration approaches rely on the iterative optimization of a similarity metric which is usually costly in time complexity. In recent years, convolutional neural network (CNN) based image registration methods have shown good effectiveness. In the meantime, recent studies show that the attention-based model (e.g., Transformer) can bring superior performance in pattern recognition tasks. In contrast, whether the superior performance of the Transformer comes from the long-winded architecture or is attributed to the use of patches for dividing the inputs is unclear yet. This work introduces three patch-based frameworks for image registration using MLPs and transformers. We provide experiments on 2D-echocardiography registration to answer the former question partially and provide a benchmark solution. Our results on a large public 2D echocardiography dataset show that the patch-based MLP/Transformer model can be effectively used for unsupervised echocardiography registration. They demonstrate comparable and even better registration performance than a popular CNN registration model. In particular, patch-based models better preserve volume changes in terms of Jacobian determinants, thus generating robust registration fields with less unrealistic deformation. Our results demonstrate that patch-based learning methods, whether with attention or not, can perform high-performance unsupervised registration tasks with adequate time and space complexity. Our codes are available https://gitlab.inria.fr/epione/mlp\_transformer\_registration
Text-guided image generation models can be prompted to generate images using nonce words adversarially designed to robustly evoke specific visual concepts. Two approaches for such generation are introduced: macaronic prompting, which involves designing cryptic hybrid words by concatenating subword units from different languages; and evocative prompting, which involves designing nonce words whose broad morphological features are similar enough to that of existing words to trigger robust visual associations. The two methods can also be combined to generate images associated with more specific visual concepts. The implications of these techniques for the circumvention of existing approaches to content moderation, and particularly the generation of offensive or harmful images, are discussed.