Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Image": models, code, and papers

Language Guided Adversarial Purification

Sep 19, 2023
Himanshu Singh, A V Subramanyam

Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.

Via

Access Paper or Ask Questions

Exposing the Unseen: Exposure Time Emulation for Offline Benchmarking of Vision Algorithms

Sep 22, 2023
Olivier Gamache, Jean-Michel Fortin, Matěj Boxan, François Pomerleau, Philippe Giguère

Figure 1 for Exposing the Unseen: Exposure Time Emulation for Offline Benchmarking of Vision Algorithms

Figure 2 for Exposing the Unseen: Exposure Time Emulation for Offline Benchmarking of Vision Algorithms

Figure 3 for Exposing the Unseen: Exposure Time Emulation for Offline Benchmarking of Vision Algorithms

Figure 4 for Exposing the Unseen: Exposure Time Emulation for Offline Benchmarking of Vision Algorithms

Visual Odometry (VO) is one of the fundamental tasks in computer vision for robotics. However, its performance is deeply affected by High Dynamic Range (HDR) scenes, omnipresent outdoor. While new Automatic-Exposure (AE) approaches to mitigate this have appeared, their comparison in a reproducible manner is problematic. This stems from the fact that the behavior of AE depends on the environment, and it affects the image acquisition process. Consequently, AE has traditionally only been benchmarked in an online manner, making the experiments non-reproducible. To solve this, we propose a new methodology based on an emulator that can generate images at any exposure time. It leverages BorealHDR, a unique multi-exposure stereo dataset collected over 8.4 km, on 50 trajectories with challenging illumination conditions. Moreover, it contains pose ground truth for each image and a global 3D map, based on lidar data. We show that using these images acquired at different exposure times, we can emulate realistic images keeping a Root-Mean-Square Error (RMSE) below 1.78 % compared to ground truth images. To demonstrate the practicality of our approach for offline benchmarking, we compared three state-of-the-art AE algorithms on key elements of Visual Simultaneous Localization And Mapping (VSLAM) pipeline, against four baselines. Consequently, reproducible evaluation of AE is now possible, speeding up the development of future approaches. Our code and dataset are available online at this link: https://github.com/norlab-ulaval/BorealHDR

* 6 pages, 6 figures, submitted to 2024 IEEE International Conference on Robotics and Automation (ICRA 2024)

Via

Access Paper or Ask Questions

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Oct 01, 2023
Shiyu Xuan, Qingpei Guo, Ming Yang, Shiliang Zhang

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in many vision-language tasks. Nevertheless, most MLLMs still lack the Referential Comprehension (RC) ability to identify a specific object or area in images, limiting their application in fine-grained perception tasks. This paper proposes a novel method to enhance the RC capability for MLLMs. Our model represents the referring object in the image using the coordinates of its bounding box and converts the coordinates into texts in a specific format. This allows the model to treat the coordinates as natural language. Moreover, we construct the instruction tuning dataset with various designed RC tasks at a low cost by unleashing the potential of annotations in existing datasets. To further boost the RC ability of the model, we propose a self-consistent bootstrapping method that extends dense object annotations of a dataset into high-quality referring-expression-bounding-box pairs. The model is trained end-to-end with a parameter-efficient tuning framework that allows both modalities to benefit from multi-modal instruction tuning. This framework requires fewer trainable parameters and less training data. Experimental results on conventional vision-language and RC tasks demonstrate the superior performance of our method. For instance, our model exhibits a 12.0% absolute accuracy improvement over Instruct-BLIP on VSR and surpasses Kosmos-2 by 24.7% on RefCOCO_val under zero-shot settings. We also attain the top position on the leaderboard of MMBench. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink

Via

Access Paper or Ask Questions

Fewshot learning on global multimodal embeddings for earth observation tasks

Sep 29, 2023
Matt Allen, Francisco Dorr, Joseph A. Gallego-Mejia, Laura Martínez-Ferrer, Anna Jungbluth, Freddie Kalaitzis, Raúl Ramos-Pollán

In this work we pretrain a CLIP/ViT based model using three different modalities of satellite imagery across five AOIs covering over ~10\% of the earth total landmass, namely Sentinel 2 RGB optical imagery, Sentinel 1 SAR amplitude and Sentinel 1 SAR interferometric coherence. This model uses $\sim 250$ M parameters. Then, we use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation related to vegetation, built up surface, croplands and permanent water. We consistently show how we reduce the need for labeled data by 99\%, so that with ~200-500 randomly selected labeled examples (around 4K-10K km$^2$) we reach performance levels analogous to those achieved with the full labeled datasets (about 150K image chips or 3M km$^2$ in each AOI) on all modalities, AOIs and downstream tasks. This leads us to think that the model has captured significant earth features useful in a wide variety of scenarios. To enhance our model's usability in practice, its architecture allows inference in contexts with missing modalities and even missing channels within each modality. Additionally, we visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.

* 7 pages, 5 figures, submitted to NeurIPS workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models

Via

Access Paper or Ask Questions

CNN-based automatic segmentation of Lumen & Media boundaries in IVUS images using closed polygonal chains

Sep 29, 2023
Pavel Sinha, Ioannis Psaromiligkos, Zeljko Zilic

We propose an automatic segmentation method for lumen and media with irregular contours in IntraVascular ultra-sound (IVUS) images. In contrast to most approaches that broadly label each pixel as either lumen, media, or background, we propose to approximate the lumen and media contours by closed polygonal chains. The chain vertices are placed at fixed angles obtained by dividing the entire 360\degree~angular space into equally spaced angles, and we predict their radius using an adaptive-subband-decomposition CNN. We consider two loss functions during training. The first is a novel loss function using the Jaccard Measure (JM) to quantify the similarities between the predicted lumen and media segments and the corresponding ground-truth image segments. The second loss function is the traditional Mean Squared Error. The proposed architecture significantly reduces computational costs by replacing the popular auto-encoder structure with a simple CNN as the encoder and the decoder is reduced to simply joining the consecutive predicted points. We evaluated our network on the publicly available IVUS-Challenge-2011 dataset using two performance metrics, namely JM and Hausdorff Distance (HD). The evaluation results show that our proposed network mostly outperforms the state-of-the-art lumen and media segmentation methods.

Via

Access Paper or Ask Questions

IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

Sep 29, 2023
Jiancheng Huang, Donghao Zhou, Shifeng Chen

Figure 1 for IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

Figure 2 for IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

Figure 3 for IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

Figure 4 for IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

Single-shot face anti-spoofing (FAS) is a key technique for securing face recognition systems, and it requires only static images as input. However, single-shot FAS remains a challenging and under-explored problem due to two main reasons: 1) on the data side, learning FAS from RGB images is largely context-dependent, and single-shot images without additional annotations contain limited semantic information. 2) on the model side, existing single-shot FAS models are infeasible to provide proper evidence for their decisions, and FAS methods based on depth estimation require expensive per-pixel annotations. To address these issues, a large binocular NIR image dataset (BNI-FAS) is constructed and published, which contains more than 300,000 real face and plane attack images, and an Interpretable FAS Transformer (IFAST) is proposed that requires only weak supervision to produce interpretable predictions. Our IFAST can produce pixel-wise disparity maps by the proposed disparity estimation Transformer with Dynamic Matching Attention (DMA) block. Besides, a well-designed confidence map generator is adopted to cooperate with the proposed dual-teacher distillation module to obtain the final discriminant results. The comprehensive experiments show that our IFAST can achieve state-of-the-art results on BNI-FAS, proving the effectiveness of the single-shot FAS based on binocular NIR images.

Via

Access Paper or Ask Questions

Vision Transformers for Computer Go

Sep 22, 2023
Amani Sagri, Tristan Cazenave, Jérôme Arjonilla, Abdallah Saffidine

Figure 1 for Vision Transformers for Computer Go

Figure 2 for Vision Transformers for Computer Go

Figure 3 for Vision Transformers for Computer Go

Figure 4 for Vision Transformers for Computer Go

Motivated by the success of transformers in various fields, such as language understanding and image analysis, this investigation explores their application in the context of the game of Go. In particular, our study focuses on the analysis of the Transformer in Vision. Through a detailed analysis of numerous points such as prediction accuracy, win rates, memory, speed, size, or even learning rate, we have been able to highlight the substantial role that transformers can play in the game of Go. This study was carried out by comparing them to the usual Residual Networks.

Via

Access Paper or Ask Questions

Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

Sep 18, 2023
Weijian Huang, Cheng Li, Hao Yang, Jiarun Liu, Shanshan Wang

Figure 1 for Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

Figure 2 for Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

Figure 3 for Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

Figure 4 for Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning

Recently, multi-modal vision-language foundation models have gained significant attention in the medical field. While these models offer great opportunities, they still face a number of challenges, such as the requirement for fine-grained knowledge understanding in computer-aided diagnosis and capability of utilizing very limited or no task-specific labeled data in real-world clinical applications. In this study, we present MaCo, a novel multi-modal medical foundation model that explores masked contrastive learning to achieve granular alignment and zero-shot learning for a variety of medical imaging tasks. MaCo incorporates a correlation weighting mechanism to adjust the correlation between masked image patches and their corresponding reports, thereby enhancing the representation learning capabilities. We evaluate MaCo on six well-known open-source X-ray datasets, and the experimental results show it outperforms seven state-of-the-art approaches for classification, segmentation, and zero-shot phase grounding, demonstrating its great potential to promote a wide range of medical image analysis tasks.

Via

Access Paper or Ask Questions

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Sep 18, 2023
Yating liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin Liao

Figure 1 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Figure 2 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Figure 3 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Figure 4 for CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Text-based Person Retrieval aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer(CSKT) approach for TBPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Self-Attention (MHSA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.

Via

Access Paper or Ask Questions

Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

Sep 18, 2023
Rong Liu, Enyu Zhao, Zhiyuan Liu, Andrew Wei-Wen Feng, Scott John Easley

Figure 1 for Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

Figure 2 for Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

Figure 3 for Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

Figure 4 for Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

In this paper, we propose an Instant Photorealistic Style Transfer (IPST) approach, designed to achieve instant photorealistic style transfer on super-resolution inputs without the need for pre-training on pair-wise datasets or imposing extra constraints. Our method utilizes a lightweight StyleNet to enable style transfer from a style image to a content image while preserving non-color information. To further enhance the style transfer process, we introduce an instance-adaptive optimization to prioritize the photorealism of outputs and accelerate the convergence of the style network, leading to a rapid training completion within seconds. Moreover, IPST is well-suited for multi-frame style transfer tasks, as it retains temporal and multi-view consistency of the multi-frame inputs such as video and Neural Radiance Field (NeRF). Experimental results demonstrate that IPST requires less GPU memory usage, offers faster multi-frame transfer speed, and generates photorealistic outputs, making it a promising solution for various photorealistic transfer applications.

* 8 pages (reference excluded), 6 figures, 4 tables

Via

Access Paper or Ask Questions