Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeynep Akata

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Oct 19, 2022

Abhra Chaudhuri, Massimiliano Mancini, Yanbei Chen, Zeynep Akata, Anjan Dutta

Figure 1 for Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Figure 2 for Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Figure 3 for Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Figure 4 for Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Abstract:Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.

* British Machine Vision Conference (BMVC) 2022

Via

Access Paper or Ask Questions

Disentanglement of Correlated Factors via Hausdorff Factorized Support

Oct 13, 2022

Karsten Roth, Mark Ibrahim, Zeynep Akata, Pascal Vincent, Diane Bouchacourt

Figure 1 for Disentanglement of Correlated Factors via Hausdorff Factorized Support

Figure 2 for Disentanglement of Correlated Factors via Hausdorff Factorized Support

Figure 3 for Disentanglement of Correlated Factors via Hausdorff Factorized Support

Figure 4 for Disentanglement of Correlated Factors via Hausdorff Factorized Support

Abstract:A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a models representations with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we propose a relaxed disentanglement criterion - the Hausdorff Factorized Support (HFS) criterion - that encourages a factorized support, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over +60% in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization.

Via

Access Paper or Ask Questions

Relational Proxies: Emergent Relationships as Fine-Grained Discriminators

Oct 05, 2022

Abhra Chaudhuri, Massimiliano Mancini, Zeynep Akata, Anjan Dutta

Figure 1 for Relational Proxies: Emergent Relationships as Fine-Grained Discriminators

Figure 2 for Relational Proxies: Emergent Relationships as Fine-Grained Discriminators

Figure 3 for Relational Proxies: Emergent Relationships as Fine-Grained Discriminators

Figure 4 for Relational Proxies: Emergent Relationships as Fine-Grained Discriminators

Abstract:Fine-grained categories that largely share the same set of parts cannot be discriminated based on part information alone, as they mostly differ in the way the local parts relate to the overall global structure of the object. We propose Relational Proxies, a novel approach that leverages the relational information between the global and local views of an object for encoding its semantic label. Starting with a rigorous formalization of the notion of distinguishability between fine-grained categories, we prove the necessary and sufficient conditions that a model must satisfy in order to learn the underlying decision boundaries in the fine-grained setting. We design Relational Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets and achieve state-of-the-art results on all of them, surpassing the performance of all existing works with a margin exceeding 4% in some cases. We also experimentally validate our theory on fine-grained distinguishability and obtain consistent results across multiple benchmarks. Implementation is available at https://github.com/abhrac/relational-proxies.

* Neural Information Processing Systems (NeurIPS) 2022

Via

Access Paper or Ask Questions

Semantic Image Synthesis with Semantically Coupled VQ-Model

Sep 06, 2022

Stephan Alaniz, Thomas Hummel, Zeynep Akata

Figure 1 for Semantic Image Synthesis with Semantically Coupled VQ-Model

Figure 2 for Semantic Image Synthesis with Semantically Coupled VQ-Model

Figure 3 for Semantic Image Synthesis with Semantically Coupled VQ-Model

Figure 4 for Semantic Image Synthesis with Semantically Coupled VQ-Model

Abstract:Semantic image synthesis enables control over unconditional image generation by allowing guidance on what is being generated. We conditionally synthesize the latent space from a vector quantized model (VQ-model) pre-trained to autoencode images. Instead of training an autoregressive Transformer on separately learned conditioning latents and image latents, we find that jointly learning the conditioning and image latents significantly improves the modeling capabilities of the Transformer model. While our jointly trained VQ-model achieves a similar reconstruction performance to a vanilla VQ-model for both semantic and image latents, tying the two modalities at the autoencoding stage proves to be an important ingredient to improve autoregressive modeling performance. We show that our model improves semantic image synthesis using autoregressive models on popular semantic image datasets ADE20k, Cityscapes and COCO-Stuff.

* ICLR 2022 DGM4HSD

Via

Access Paper or Ask Questions

Semi-Supervised and Unsupervised Deep Visual Learning: A Survey

Aug 24, 2022

Yanbei Chen, Massimiliano Mancini, Xiatian Zhu, Zeynep Akata

Figure 1 for Semi-Supervised and Unsupervised Deep Visual Learning: A Survey

Figure 2 for Semi-Supervised and Unsupervised Deep Visual Learning: A Survey

Figure 3 for Semi-Supervised and Unsupervised Deep Visual Learning: A Survey

Figure 4 for Semi-Supervised and Unsupervised Deep Visual Learning: A Survey

Abstract:State-of-the-art deep learning models are often trained with a large amount of costly labeled training data. However, requiring exhaustive manual annotations may degrade the model's generalizability in the limited-label regime. Semi-supervised learning and unsupervised learning offer promising paradigms to learn from an abundance of unlabeled visual data. Recent progress in these paradigms has indicated the strong benefits of leveraging unlabeled data to improve model generalization and provide better model initialization. In this survey, we review the recent advanced deep learning algorithms on semi-supervised learning (SSL) and unsupervised learning (UL) for visual recognition from a unified perspective. To offer a holistic understanding of the state-of-the-art in these areas, we propose a unified taxonomy. We categorize existing representative SSL and UL with comprehensive and insightful analysis to highlight their design rationales in different learning scenarios and applications in different computer vision tasks. Lastly, we discuss the emerging trends and open challenges in SSL and UL to shed light on future critical research directions.

* IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Via

Access Paper or Ask Questions

Abstracting Sketches through Simple Primitives

Jul 27, 2022

Stephan Alaniz, Massimiliano Mancini, Anjan Dutta, Diego Marcos, Zeynep Akata

Figure 1 for Abstracting Sketches through Simple Primitives

Figure 2 for Abstracting Sketches through Simple Primitives

Figure 3 for Abstracting Sketches through Simple Primitives

Figure 4 for Abstracting Sketches through Simple Primitives

Abstract:Humans show high-level of abstraction capabilities in games that require quickly communicating object information. They decompose the message content into multiple parts and communicate them in an interpretable protocol. Toward equipping machines with such capabilities, we propose the Primitive-based Sketch Abstraction task where the goal is to represent sketches using a fixed set of drawing primitives under the influence of a budget. To solve this task, our Primitive-Matching Network (PMN), learns interpretable abstractions of a sketch in a self supervised manner. Specifically, PMN maps each stroke of a sketch to its most similar primitive in a given set, predicting an affine transformation that aligns the selected primitive to the target stroke. We learn this stroke-to-primitive mapping end-to-end with a distance-transform loss that is minimal when the original sketch is precisely reconstructed with the predicted primitives. Our PMN abstraction empirically achieves the highest performance on sketch recognition and sketch-based image retrieval given a communication budget, while at the same time being highly interpretable. This opens up new possibilities for sketch analysis, such as comparing sketches by extracting the most relevant primitives that define an object category. Code is available at https://github.com/ExplainableML/sketch-primitives.

* European Conference on Computer Vision (ECCV) 2022

Via

Access Paper or Ask Questions

Temporal and cross-modal attention for audio-visual zero-shot learning

Jul 20, 2022

Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Figure 1 for Temporal and cross-modal attention for audio-visual zero-shot learning

Figure 2 for Temporal and cross-modal attention for audio-visual zero-shot learning

Figure 3 for Temporal and cross-modal attention for audio-visual zero-shot learning

Figure 4 for Temporal and cross-modal attention for audio-visual zero-shot learning

Abstract:Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to unseen classes at test time. We propose a multi-modal and Temporal Cross-attention Framework (\modelName) for audio-visual generalised zero-shot learning. Its inputs are temporally aligned audio and visual features that are obtained from pre-trained networks. Encouraging the framework to focus on cross-modal correspondence across time instead of self-attention within the modalities boosts the performance significantly. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning. Code for reproducing all results is available at \url{https://github.com/ExplainableML/TCAF-GZSL}.

* ECCV 2022

Via

Access Paper or Ask Questions

BayesCap: Bayesian Identity Cap for Calibrated Uncertainty in Frozen Neural Networks

Jul 14, 2022

Uddeshya Upadhyay, Shyamgopal Karthik, Yanbei Chen, Massimiliano Mancini, Zeynep Akata

Figure 1 for BayesCap: Bayesian Identity Cap for Calibrated Uncertainty in Frozen Neural Networks

Abstract:High-quality calibrated uncertainty estimates are crucial for numerous real-world applications, especially for deep learning-based deployed ML systems. While Bayesian deep learning techniques allow uncertainty estimation, training them with large-scale datasets is an expensive process that does not always yield models competitive with non-Bayesian counterparts. Moreover, many of the high-performing deep learning models that are already trained and deployed are non-Bayesian in nature and do not provide uncertainty estimates. To address these issues, we propose BayesCap that learns a Bayesian identity mapping for the frozen model, allowing uncertainty estimation. BayesCap is a memory-efficient method that can be trained on a small fraction of the original dataset, enhancing pretrained non-Bayesian computer vision models by providing calibrated uncertainty estimates for the predictions without (i) hampering the performance of the model and (ii) the need for expensive retraining the model from scratch. The proposed method is agnostic to various architectures and tasks. We show the efficacy of our method on a wide variety of tasks with a diverse set of architectures, including image super-resolution, deblurring, inpainting, and crucial application such as medical image translation. Moreover, we apply the derived uncertainty estimates to detect out-of-distribution samples in critical scenarios like depth estimation in autonomous driving. Code is available at https://github.com/ExplainableML/BayesCap.

* Accepted at ECCV 2022. Code is available at https://github.com/ExplainableML/BayesCap

Via

Access Paper or Ask Questions

A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning

Jul 08, 2022

Michael Kirchhof, Karsten Roth, Zeynep Akata, Enkelejda Kasneci

Figure 1 for A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning

Figure 2 for A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning

Figure 3 for A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning

Figure 4 for A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning

Abstract:Proxy-based Deep Metric Learning (DML) learns deep representations by embedding images close to their class representatives (proxies), commonly with respect to the angle between them. However, this disregards the embedding norm, which can carry additional beneficial context such as class- or image-intrinsic uncertainty. In addition, proxy-based DML struggles to learn class-internal structures. To address both issues at once, we introduce non-isotropic probabilistic proxy-based DML. We model images as directional von Mises-Fisher (vMF) distributions on the hypersphere that can reflect image-intrinsic uncertainties. Further, we derive non-isotropic von Mises-Fisher (nivMF) distributions for class proxies to better represent complex class-specific variances. To measure the proxy-to-image distance between these models, we develop and investigate multiple distribution-to-point and distribution-to-distribution metrics. Each framework choice is motivated by a set of ablational studies, which showcase beneficial properties of our probabilistic approach to proxy-based DML, such as uncertainty-awareness, better-behaved gradients during training, and overall improved generalization performance. The latter is especially reflected in the competitive performance on the standard DML benchmarks, where our approach compares favorably, suggesting that existing proxy-based DML can significantly benefit from a more probabilistic treatment. Code is available at github.com/ExplainableML/Probabilistic_Deep_Metric_Learning.

* Accepted as conference paper at ECCV 2022

Via

Access Paper or Ask Questions

The Manifold Hypothesis for Gradient-Based Explanations

Jun 15, 2022

Sebastian Bordt, Uddeshya Upadhyay, Zeynep Akata, Ulrike von Luxburg

Figure 1 for The Manifold Hypothesis for Gradient-Based Explanations

Figure 2 for The Manifold Hypothesis for Gradient-Based Explanations

Figure 3 for The Manifold Hypothesis for Gradient-Based Explanations

Figure 4 for The Manifold Hypothesis for Gradient-Based Explanations

Abstract:When do gradient-based explanation algorithms provide meaningful explanations? We propose a necessary criterion: their feature attributions need to be aligned with the tangent space of the data manifold. To provide evidence for this hypothesis, we introduce a framework based on variational autoencoders that allows to estimate and generate image manifolds. Through experiments across a range of different datasets -- MNIST, EMNIST, CIFAR10, X-ray pneumonia and Diabetic Retinopathy detection -- we demonstrate that the more a feature attribution is aligned with the tangent space of the data, the more structured and explanatory it tends to be. In particular, the attributions provided by popular post-hoc methods such as Integrated Gradients, SmoothGrad and Input $\times$ Gradient tend to be more strongly aligned with the data manifold than the raw gradient. As a consequence, we suggest that explanation algorithms should actively strive to align their explanations with the data manifold. In part, this can be achieved by adversarial training, which leads to better alignment across all datasets. Some form of adjustment to the model architecture or training algorithm is necessary, since we show that generalization of neural networks alone does not imply the alignment of model gradients with the data manifold.

Via

Access Paper or Ask Questions