Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeynep Akata

Integrating Language Guidance into Vision-based Deep Metric Learning

Mar 16, 2022

Karsten Roth, Oriol Vinyals, Zeynep Akata

Figure 1 for Integrating Language Guidance into Vision-based Deep Metric Learning

Figure 2 for Integrating Language Guidance into Vision-based Deep Metric Learning

Figure 3 for Integrating Language Guidance into Vision-based Deep Metric Learning

Figure 4 for Integrating Language Guidance into Vision-based Deep Metric Learning

Abstract:Deep Metric Learning (DML) proposes to learn metric spaces which encode semantic similarities as embedding space distances. These spaces should be transferable to classes beyond those seen during training. Commonly, DML methods task networks to solve contrastive ranking tasks defined over binary class assignments. However, such approaches ignore higher-level semantic relations between the actual classes. This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes, impacting the generalizability of the learned metric space. To tackle this issue, we propose a language guidance objective for visual similarity learning. Leveraging language embeddings of expert- and pseudo-classnames, we contextualize and realign visual representation spaces corresponding to meaningful language semantics for better semantic consistency. Extensive experiments and ablations provide a strong motivation for our proposed approach and show language guidance offering significant, model-agnostic improvements for DML, achieving competitive and state-of-the-art results on all benchmarks. Code available at https://github.com/ExplainableML/LanguageGuidance_for_DML.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR

Jan 17, 2022

Ushasi Chaudhuri, Ruchika Chavan, Biplab Banerjee, Anjan Dutta, Zeynep Akata

Figure 1 for BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR

Figure 2 for BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR

Figure 3 for BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR

Figure 4 for BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR

Abstract:The efficacy of zero-shot sketch-based image retrieval (ZS-SBIR) models is governed by two challenges. The immense distributions-gap between the sketches and the images requires a proper domain alignment. Moreover, the fine-grained nature of the task and the high intra-class variance of many categories necessitates a class-wise discriminative mapping among the sketch, image, and the semantic spaces. Under this premise, we propose BDA-SketRet, a novel ZS-SBIR framework performing a bi-level domain adaptation for aligning the spatial and semantic features of the visual data pairs progressively. In order to highlight the shared features and reduce the effects of any sketch or image-specific artifacts, we propose a novel symmetric loss function based on the notion of information bottleneck for aligning the semantic features while a cross-entropy-based adversarial loss is introduced to align the spatial feature maps. Finally, our CNN-based model confirms the discriminativeness of the shared latent space through a novel topology-preserving semantic projection network. Experimental results on the extended Sketchy, TU-Berlin, and QuickDraw datasets exhibit sharp improvements over the literature.

Via

Access Paper or Ask Questions

Audio Retrieval with Natural Language Queries: A Benchmark Study

Dec 17, 2021

A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie

Figure 1 for Audio Retrieval with Natural Language Queries: A Benchmark Study

Figure 2 for Audio Retrieval with Natural Language Queries: A Benchmark Study

Figure 3 for Audio Retrieval with Natural Language Queries: A Benchmark Study

Figure 4 for Audio Retrieval with Natural Language Queries: A Benchmark Study

Abstract:The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the \datasetName dataset will be made publicly available.

* Submitted to Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2105.02192

Via

Access Paper or Ask Questions

Human Attention in Fine-grained Classification

Nov 02, 2021

Yao Rong, Wenjia Xu, Zeynep Akata, Enkelejda Kasneci

Figure 1 for Human Attention in Fine-grained Classification

Figure 2 for Human Attention in Fine-grained Classification

Figure 3 for Human Attention in Fine-grained Classification

Figure 4 for Human Attention in Fine-grained Classification

Abstract:The way humans attend to, process and classify a given image has the potential to vastly benefit the performance of deep learning models. Exploiting where humans are focusing can rectify models when they are deviating from essential features for correct decisions. To validate that human attention contains valuable information for decision-making processes such as fine-grained classification, we compare human attention and model explanations in discovering important features. Towards this goal, we collect human gaze data for the fine-grained classification dataset CUB and build a dataset named CUB-GHA (Gaze-based Human Attention). Furthermore, we propose the Gaze Augmentation Training (GAT) and Knowledge Fusion Network (KFN) to integrate human gaze knowledge into classification models. We implement our proposals in CUB-GHA and the recently released medical dataset CXR-Eye of chest X-ray images, which includes gaze data collected from a radiologist. Our result reveals that integrating human attention knowledge benefits classification effectively, e.g. improving the baseline by 4.38% on CXR. Hence, our work provides not only valuable insights into understanding human attention in fine-grained classification, but also contributes to future research in integrating human gaze with computer vision tasks. CUB-GHA and code are available at https://github.com/yaorong0921/CUB-GHA.

* British Machine Vision Conference (BMVC) 2021
* 19 pages, 9 figures

Via

Access Paper or Ask Questions

Robustness via Uncertainty-aware Cycle Consistency

Oct 24, 2021

Uddeshya Upadhyay, Yanbei Chen, Zeynep Akata

Figure 1 for Robustness via Uncertainty-aware Cycle Consistency

Figure 2 for Robustness via Uncertainty-aware Cycle Consistency

Figure 3 for Robustness via Uncertainty-aware Cycle Consistency

Figure 4 for Robustness via Uncertainty-aware Cycle Consistency

Abstract:Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty, leading to performance degradation when encountering unseen perturbations at test time. To address this, we propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC), which models the per-pixel residual by generalized Gaussian distribution, capable of modelling heavy-tailed distributions. We compare our model with a wide variety of state-of-the-art methods on various challenging tasks including unpaired image translation of natural images, using standard datasets, spanning autonomous driving, maps, facades, and also in medical imaging domain consisting of MRI. Experimental results demonstrate that our method exhibits stronger robustness towards unseen perturbations in test data. Code is released here: https://github.com/ExplainableML/UncertaintyAwareCycleConsistency.

* Accepted at NeurIPS 2021. Code is at https://github.com/ExplainableML/UncertaintyAwareCycleConsistency. arXiv admin note: substantial text overlap with arXiv:2102.11747

Via

Access Paper or Ask Questions

Conditional De-Identification of 3D Magnetic Resonance Images

Oct 18, 2021

Lennart Alexander Van der Goten, Tobias Hepp, Zeynep Akata, Kevin Smith

Figure 1 for Conditional De-Identification of 3D Magnetic Resonance Images

Figure 2 for Conditional De-Identification of 3D Magnetic Resonance Images

Figure 3 for Conditional De-Identification of 3D Magnetic Resonance Images

Figure 4 for Conditional De-Identification of 3D Magnetic Resonance Images

Abstract:Privacy protection of medical image data is challenging. Even if metadata is removed, brain scans are vulnerable to attacks that match renderings of the face to facial image databases. Solutions have been developed to de-identify diagnostic scans by obfuscating or removing parts of the face. However, these solutions either fail to reliably hide the patient's identity or are so aggressive that they impair further analyses. We propose a new class of de-identification techniques that, instead of removing facial features, remodels them. Our solution relies on a conditional multi-scale GAN architecture. It takes a patient's MRI scan as input and generates a 3D volume conditioned on the patient's brain, which is preserved exactly, but where the face has been de-identified through remodeling. We demonstrate that our approach preserves privacy far better than existing techniques, without compromising downstream medical analyses. Analyses were run on the OASIS-3 and ADNI corpora.

Via

Access Paper or Ask Questions

Fine-Grained Zero-Shot Learning with DNA as Side Information

Sep 29, 2021

Sarkhan Badirli, Zeynep Akata, George Mohler, Christine Picard, Murat Dundar

Figure 1 for Fine-Grained Zero-Shot Learning with DNA as Side Information

Figure 2 for Fine-Grained Zero-Shot Learning with DNA as Side Information

Figure 3 for Fine-Grained Zero-Shot Learning with DNA as Side Information

Figure 4 for Fine-Grained Zero-Shot Learning with DNA as Side Information

Abstract:Fine-grained zero-shot learning task requires some form of side-information to transfer discriminative information from seen to unseen classes. As manually annotated visual attributes are extremely costly and often impractical to obtain for a large number of classes, in this study we use DNA as side information for the first time for fine-grained zero-shot classification of species. Mitochondrial DNA plays an important role as a genetic marker in evolutionary biology and has been used to achieve near-perfect accuracy in the species classification of living organisms. We implement a simple hierarchical Bayesian model that uses DNA information to establish the hierarchy in the image space and employs local priors to define surrogate classes for unseen ones. On the benchmark CUB dataset, we show that DNA can be equally promising yet in general a more accessible alternative than word vectors as a side information. This is especially important as obtaining robust word representations for fine-grained species names is not a practicable goal when information about these species in free-form text is limited. On a newly compiled fine-grained insect dataset that uses DNA information from over a thousand species, we show that the Bayesian approach outperforms state-of-the-art by a wide margin.

* Accepted to NeurIPS 2021

Via

Access Paper or Ask Questions

Concurrent Discrimination and Alignment for Self-Supervised Feature Learning

Aug 19, 2021

Anjan Dutta, Massimiliano Mancini, Zeynep Akata

Figure 1 for Concurrent Discrimination and Alignment for Self-Supervised Feature Learning

Figure 2 for Concurrent Discrimination and Alignment for Self-Supervised Feature Learning

Figure 3 for Concurrent Discrimination and Alignment for Self-Supervised Feature Learning

Figure 4 for Concurrent Discrimination and Alignment for Self-Supervised Feature Learning

Abstract:Existing self-supervised learning methods learn representation by means of pretext tasks which are either (1) discriminating that explicitly specify which features should be separated or (2) aligning that precisely indicate which features should be closed together, but ignore the fact how to jointly and principally define which features to be repelled and which ones to be attracted. In this work, we combine the positive aspects of the discriminating and aligning methods, and design a hybrid method that addresses the above issue. Our method explicitly specifies the repulsion and attraction mechanism respectively by discriminative predictive task and concurrently maximizing mutual information between paired views sharing redundant information. We qualitatively and quantitatively show that our proposed model learns better features that are more effective for the diverse downstream tasks ranging from classification to semantic segmentation. Our experiments on nine established benchmarks show that the proposed model consistently outperforms the existing state-of-the-art results of self-supervised and transfer learning protocol.

* International Conference on Computer Vision (DeepMTL) 2021

Via

Access Paper or Ask Questions

Uncertainty-Guided Progressive GANs for Medical Image Translation

Jul 02, 2021

Uddeshya Upadhyay, Yanbei Chen, Tobias Hepp, Sergios Gatidis, Zeynep Akata

Figure 1 for Uncertainty-Guided Progressive GANs for Medical Image Translation

Figure 2 for Uncertainty-Guided Progressive GANs for Medical Image Translation

Figure 3 for Uncertainty-Guided Progressive GANs for Medical Image Translation

Figure 4 for Uncertainty-Guided Progressive GANs for Medical Image Translation

Abstract:Image-to-image translation plays a vital role in tackling various medical imaging tasks such as attenuation correction, motion correction, undersampled reconstruction, and denoising. Generative adversarial networks have been shown to achieve the state-of-the-art in generating high fidelity images for these tasks. However, the state-of-the-art GAN-based frameworks do not estimate the uncertainty in the predictions made by the network that is essential for making informed medical decisions and subsequent revision by medical experts and has recently been shown to improve the performance and interpretability of the model. In this work, we propose an uncertainty-guided progressive learning scheme for image-to-image translation. By incorporating aleatoric uncertainty as attention maps for GANs trained in a progressive manner, we generate images of increasing fidelity progressively. We demonstrate the efficacy of our model on three challenging medical image translation tasks, including PET to CT translation, undersampled MRI reconstruction, and MRI motion artefact correction. Our model generalizes well in three different tasks and improves performance over state of the art under full-supervision and weak-supervision with limited data. Code is released here: https://github.com/ExplainableML/UncerGuidedI2I

* accepted at MICCAI 2021, code is released here: https://github.com/ExplainableML/UncerGuidedI2I

Via

Access Paper or Ask Questions

Keep CALM and Improve Visual Feature Attribution

Jun 15, 2021

Jae Myung Kim, Junsuk Choe, Zeynep Akata, Seong Joon Oh

Figure 1 for Keep CALM and Improve Visual Feature Attribution

Figure 2 for Keep CALM and Improve Visual Feature Attribution

Figure 3 for Keep CALM and Improve Visual Feature Attribution

Figure 4 for Keep CALM and Improve Visual Feature Attribution

Abstract:The class activation mapping, or CAM, has been the cornerstone of feature attribution methods for multiple vision tasks. Its simplicity and effectiveness have led to wide applications in the explanation of visual predictions and weakly-supervised localization tasks. However, CAM has its own shortcomings. The computation of attribution maps relies on ad-hoc calibration steps that are not part of the training computational graph, making it difficult for us to understand the real meaning of the attribution values. In this paper, we improve CAM by explicitly incorporating a latent variable encoding the location of the cue for recognition in the formulation, thereby subsuming the attribution map into the training computational graph. The resulting model, class activation latent mapping, or CALM, is trained with the expectation-maximization algorithm. Our experiments show that CALM identifies discriminative attributes for image classifiers more accurately than CAM and other visual attribution baselines. CALM also shows performance improvements over prior arts on the weakly-supervised object localization benchmarks. Our code is available at https://github.com/naver-ai/calm.

* 20 pages, 11 figures

Via

Access Paper or Ask Questions