Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dilip Krishnan

Google Research

Muse: Text-To-Image Generation via Masked Generative Transformers

Jan 02, 2023

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein(+2 more)

Figure 1 for Muse: Text-To-Image Generation via Masked Generative Transformers

Figure 2 for Muse: Text-To-Image Generation via Masked Generative Transformers

Figure 3 for Muse: Text-To-Image Generation via Masked Generative Transformers

Figure 4 for Muse: Text-To-Image Generation via Masked Generative Transformers

Abstract:We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io

Via

Access Paper or Ask Questions

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Nov 16, 2022

Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan

Abstract:Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

Via

Access Paper or Ask Questions

A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Oct 30, 2022

Shlok Mishra, Joshua Robinson, Huiwen Chang, David Jacobs, Aaron Sarna, Aaron Maschinot, Dilip Krishnan

Figure 1 for A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Figure 2 for A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Figure 3 for A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Figure 4 for A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

Abstract:We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on ImageNet, but is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models.

* Mishra and Robinson contributed equally

Via

Access Paper or Ask Questions

Object-Aware Cropping for Self-Supervised Learning

Dec 01, 2021

Shlok Mishra, Anshul Shah, Ankan Bansal, Abhyuday Jagannatha, Abhishek Sharma, David Jacobs, Dilip Krishnan

Figure 1 for Object-Aware Cropping for Self-Supervised Learning

Figure 2 for Object-Aware Cropping for Self-Supervised Learning

Figure 3 for Object-Aware Cropping for Self-Supervised Learning

Figure 4 for Object-Aware Cropping for Self-Supervised Learning

Abstract:A core component of the recent success of self-supervised learning is cropping data augmentation, which selects sub-regions of an image to be used as positive views in the self-supervised loss. The underlying assumption is that randomly cropped and resized regions of a given image share information about the objects of interest, which the learned representation will capture. This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image. However, in other datasets such as OpenImages or COCO, which are more representative of real world uncurated data, there are typically multiple small objects in an image. In this work, we show that self-supervised learning based on the usual random cropping performs poorly on such datasets. We propose replacing one or both of the random crops with crops obtained from an object proposal algorithm. This encourages the model to learn both object and scene level semantic representations. Using this approach, which we call object-aware cropping, results in significant improvements over scene cropping on classification and object detection benchmarks. For example, on OpenImages, our approach achieves an improvement of 8.8% mAP over random scene-level cropping using MoCo-v2 based pre-training. We also show significant improvements on COCO and PASCAL-VOC object detection and segmentation tasks over the state-of-the-art self-supervised learning approaches. Our approach is efficient, simple and general, and can be used in most existing contrastive and non-contrastive self-supervised learning frameworks.

Via

Access Paper or Ask Questions

Pyramid Adversarial Training Improves ViT Performance

Nov 30, 2021

Charles Herrmann, Kyle Sargent, Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu, Dilip Krishnan, Deqing Sun

Figure 1 for Pyramid Adversarial Training Improves ViT Performance

Figure 2 for Pyramid Adversarial Training Improves ViT Performance

Figure 3 for Pyramid Adversarial Training Improves ViT Performance

Figure 4 for Pyramid Adversarial Training Improves ViT Performance

Abstract:Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). One such data augmentation technique is adversarial training; however, many prior works have shown that this often results in poor clean accuracy. In this work, we present Pyramid Adversarial Training, a simple and effective technique to improve ViT's overall performance. We pair it with a "matched" Dropout and stochastic depth regularization, which adopts the same Dropout and stochastic depth configuration for the clean and adversarial samples. Similar to the improvements on CNNs by AdvProp (not directly applicable to ViT), our Pyramid Adversarial Training breaks the trade-off between in-distribution accuracy and out-of-distribution robustness for ViT and related architectures. It leads to $1.82\%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data, while simultaneously boosting performance on $7$ ImageNet robustness metrics, by absolute numbers ranging from $1.76\%$ to $11.45\%$. We set a new state-of-the-art for ImageNet-C (41.4 mCE), ImageNet-R ($53.92\%$), and ImageNet-Sketch ($41.04\%$) without extra data, using only the ViT-B/16 backbone and our Pyramid Adversarial Training. Our code will be publicly available upon acceptance.

* 32 pages, including references & supplementary material

Via

Access Paper or Ask Questions

Contrastive Multiview Coding for Enzyme-Substrate Interaction Prediction

Nov 18, 2021

Apurva Kalia, Dilip Krishnan, Soha Hassoun

Figure 1 for Contrastive Multiview Coding for Enzyme-Substrate Interaction Prediction

Abstract:Characterizing Enzyme function is an important requirement for predicting Enzyme-Substrate interactions. In this paper, we present a novel approach of applying Contrastive Multiview Coding to this problem to improve the performance of prediction. We present a method to leverage auxiliary data from an Enzymatic database like KEGG to learn the mutual information present in multiple views of enzyme-substrate reactions. We show that congruency in the multiple views of the reaction data can be used to improve prediction performance.

* 3 pages, submitted to Machine Learning in Computational Biology 2021

Via

Access Paper or Ask Questions

Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions

Aug 14, 2021

Andrea Burns, Aaron Sarna, Dilip Krishnan, Aaron Maschinot

Figure 1 for Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions

Figure 2 for Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions

Figure 3 for Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions

Figure 4 for Unsupervised Disentanglement without Autoencoding: Pitfalls and Future Directions

Abstract:Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs). While prior work has focused on generative methods for disentangled representation learning, these approaches do not scale to large datasets due to current limitations of generative models. Instead, we explore regularization methods with contrastive learning, which could result in disentangled representations that are powerful enough for large scale datasets and downstream applications. However, we find that unsupervised disentanglement is difficult to achieve due to optimization and initialization sensitivity, with trade-offs in task performance. We evaluate disentanglement with downstream tasks, analyze the benefits and disadvantages of each regularization used, and discuss future directions.

* Accepted at the ICML 2021 Self-Supervised Learning for Reasoning and Perception Workshop

Via

Access Paper or Ask Questions

Understanding invariance via feedforward inversion of discriminatively trained classifiers

Mar 15, 2021

Piotr Teterwak, Chiyuan Zhang, Dilip Krishnan, Michael C. Mozer

Figure 1 for Understanding invariance via feedforward inversion of discriminatively trained classifiers

Figure 2 for Understanding invariance via feedforward inversion of discriminatively trained classifiers

Figure 3 for Understanding invariance via feedforward inversion of discriminatively trained classifiers

Figure 4 for Understanding invariance via feedforward inversion of discriminatively trained classifiers

Abstract:A discriminatively trained neural net classifier achieves optimal performance if all information about its input other than class membership has been discarded prior to the output layer. Surprisingly, past research has discovered that some extraneous visual detail remains in the output logits. This finding is based on inversion techniques that map deep embeddings back to images. Although the logit inversions seldom produce coherent, natural images or recognizable object classes, they do recover some visual detail. We explore this phenomenon further using a novel synthesis of methods, yielding a feedforward inversion model that produces remarkably high fidelity reconstructions, qualitatively superior to those of past efforts. When applied to an adversarially robust classifier model, the reconstructions contain sufficient local detail and global structure that they might be confused with the original image in a quick glance, and the object category can clearly be gleaned from the reconstruction. Our approach is based on BigGAN (Brock, 2019), with conditioning on logits instead of one-hot class labels. We use our reconstruction model as a tool for exploring the nature of representations, including: the influence of model architecture and training objectives (specifically robust losses), the forms of invariance that networks achieve, representational differences between correctly and incorrectly classified images, and the effects of manipulating logits and images. We believe that our method can inspire future investigations into the nature of information flow in a neural net and can provide diagnostics for improving discriminative models.

Via

Access Paper or Ask Questions

What makes for good views for contrastive learning

May 20, 2020

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Figure 1 for What makes for good views for contrastive learning

Figure 2 for What makes for good views for contrastive learning

Figure 3 for What makes for good views for contrastive learning

Figure 4 for What makes for good views for contrastive learning

Abstract:Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a by-product, we also achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readoff with a ResNet-50). In addition, transferring our models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised pre-training. Code:http://github.com/HobbitLong/PyContrast

* submitted to ECCV 2020

Via

Access Paper or Ask Questions

Supervised Contrastive Learning

Apr 23, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan

Figure 1 for Supervised Contrastive Learning

Figure 2 for Supervised Contrastive Learning

Figure 3 for Supervised Contrastive Learning

Figure 4 for Supervised Contrastive Learning

Abstract:Cross entropy is the most widely used loss function for supervised training of image classification models. In this paper, we propose a novel training methodology that consistently outperforms cross entropy on supervised learning tasks across different architectures and data augmentations. We modify the batch contrastive loss, which has recently been shown to be very effective at learning powerful representations in the self-supervised setting. We are thus able to leverage label information more effectively than cross entropy. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. In addition to this, we leverage key ingredients such as large batch sizes and normalized embeddings, which have been shown to benefit self-supervised learning. On both ResNet-50 and ResNet-200, we outperform cross entropy by over 1%, setting a new state of the art number of 78.8% among methods that use AutoAugment data augmentation. The loss also shows clear benefits for robustness to natural corruptions on standard benchmarks on both calibration and accuracy. Compared to cross entropy, our supervised contrastive loss is more stable to hyperparameter settings such as optimizers or data augmentations.

Via

Access Paper or Ask Questions