Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Phillip Isola

MIT

Improving CLIP Training with Language Rewrites

May 31, 2023

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, Yonglong Tian

Figure 1 for Improving CLIP Training with Language Rewrites

Figure 2 for Improving CLIP Training with Language Rewrites

Figure 3 for Improving CLIP Training with Language Rewrites

Figure 4 for Improving CLIP Training with Language Rewrites

Abstract:Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.

Via

Access Paper or Ask Questions

Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks

May 15, 2023

Minyoung Huh, Brian Cheung, Pulkit Agrawal, Phillip Isola

Figure 1 for Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks

Figure 2 for Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks

Figure 3 for Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks

Figure 4 for Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks

Abstract:This work examines the challenges of training neural networks using vector quantization using straight-through estimation. We find that a primary cause of training instability is the discrepancy between the model embedding and the code-vector distribution. We identify the factors that contribute to this issue, including the codebook gradient sparsity and the asymmetric nature of the commitment loss, which leads to misaligned code-vector assignments. We propose to address this issue via affine re-parameterization of the code vectors. Additionally, we introduce an alternating optimization to reduce the gradient error introduced by the straight-through estimation. Moreover, we propose an improvement to the commitment loss to ensure better alignment between the codebook representation and the model embedding. These optimization methods improve the mathematical approximation of the straight-through estimation and, ultimately, the model performance. We demonstrate the effectiveness of our methods on several common model architectures, such as AlexNet, ResNet, and ViT, across various tasks, including image classification and generative modeling.

Via

Access Paper or Ask Questions

Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Apr 06, 2023

Tongzhou Wang, Antonio Torralba, Phillip Isola, Amy Zhang

Figure 1 for Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Figure 2 for Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Figure 3 for Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Figure 4 for Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Abstract:In goal-reaching reinforcement learning (RL), the optimal value function has a particular geometry, called quasimetric structure. This paper introduces Quasimetric Reinforcement Learning (QRL), a new RL method that utilizes quasimetric models to learn optimal value functions. Distinct from prior approaches, the QRL objective is specifically designed for quasimetrics, and provides strong theoretical recovery guarantees. Empirically, we conduct thorough analyses on a discretized MountainCar environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations.

* Project Page: https://www.tongzhouwang.info/quasimetric_rl/

Via

Access Paper or Ask Questions

Persistent Nature: A Generative Model of Unbounded 3D Worlds

Mar 23, 2023

Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, Noah Snavely

Figure 1 for Persistent Nature: A Generative Model of Unbounded 3D Worlds

Figure 2 for Persistent Nature: A Generative Model of Unbounded 3D Worlds

Figure 3 for Persistent Nature: A Generative Model of Unbounded 3D Worlds

Figure 4 for Persistent Nature: A Generative Model of Unbounded 3D Worlds

Abstract:Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/.

* CVPR camera ready version, project page: https://chail.github.io/persistent-nature/

Via

Access Paper or Ask Questions

Steerable Equivariant Representation Learning

Feb 22, 2023

Sangnie Bhardwaj, Willie McClinton, Tongzhou Wang, Guillaume Lajoie, Chen Sun, Phillip Isola, Dilip Krishnan

Figure 1 for Steerable Equivariant Representation Learning

Figure 2 for Steerable Equivariant Representation Learning

Figure 3 for Steerable Equivariant Representation Learning

Figure 4 for Steerable Equivariant Representation Learning

Abstract:Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote invariance in the embedding space to the input image transformations. This invariance reduces generalization to those downstream tasks which rely on sensitivity to these particular data augmentations. In this paper, we propose a method of learning representations that are instead equivariant to data augmentations. We achieve this equivariance through the use of steerable representations. Our representations can be manipulated directly in embedding space via learned linear maps. We demonstrate that our resulting steerable and equivariant representations lead to better performance on transfer learning and robustness: e.g. we improve linear probe top-1 accuracy by between 1% to 3% for transfer; and ImageNet-C accuracy by upto 3.4%. We further show that the steerability of our representations provides significant speedup (nearly 50x) for test-time augmentations; by applying a large number of augmentations for out-of-distribution detection, we significantly improve OOD AUC on the ImageNet-C dataset over an invariant representation.

Via

Access Paper or Ask Questions

MIRA: Mental Imagery for Robotic Affordances

Dec 12, 2022

Lin Yen-Chen, Pete Florence, Andy Zeng, Jonathan T. Barron, Yilun Du, Wei-Chiu Ma, Anthony Simeonov, Alberto Rodriguez Garcia, Phillip Isola

Abstract:Humans form mental images of 3D scenes to support counterfactual imagination, planning, and motor control. Our abilities to predict the appearance and affordance of the scene from previously unobserved viewpoints aid us in performing manipulation tasks (e.g., 6-DoF kitting) with a level of ease that is currently out of reach for existing robot learning frameworks. In this work, we aim to build artificial systems that can analogously plan actions on top of imagined images. To this end, we introduce Mental Imagery for Robotic Affordances (MIRA), an action reasoning framework that optimizes actions with novel-view synthesis and affordance prediction in the loop. Given a set of 2D RGB images, MIRA builds a consistent 3D scene representation, through which we synthesize novel orthographic views amenable to pixel-wise affordances prediction for action optimization. We illustrate how this optimization process enables us to generalize to unseen out-of-plane rotations for 6-DoF robotic manipulation tasks given a limited number of demonstrations, paving the way toward machines that autonomously learn to understand the world around them for planning actions.

* CoRL 2022, webpage: https://yenchenlin.me/mira

Via

Access Paper or Ask Questions

Procedural Image Programs for Representation Learning

Nov 29, 2022

Manel Baradad, Chun-Fu Chen, Jonas Wulff, Tongzhou Wang, Rogerio Feris, Antonio Torralba, Phillip Isola

Figure 1 for Procedural Image Programs for Representation Learning

Figure 2 for Procedural Image Programs for Representation Learning

Figure 3 for Procedural Image Programs for Representation Learning

Figure 4 for Procedural Image Programs for Representation Learning

Abstract:Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images. These programs are short code snippets, which are easy to modify and fast to execute using OpenGL. The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.

* NeurIPS 2022
* 29 pages, Accepted in the Conference on Neural Information Processing Systems 2022 (NeurIPS 2022)

Via

Access Paper or Ask Questions

Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings

Nov 28, 2022

Tongzhou Wang, Phillip Isola

Figure 1 for Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings

Figure 2 for Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings

Figure 3 for Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings

Figure 4 for Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings

Abstract:Asymmetrical distance structures (quasimetrics) are ubiquitous in our lives and are gaining more attention in machine learning applications. Imposing such quasimetric structures in model representations has been shown to improve many tasks, including reinforcement learning (RL) and causal relation learning. In this work, we present four desirable properties in such quasimetric models, and show how prior works fail at them. We propose Interval Quasimetric Embedding (IQE), which is designed to satisfy all four criteria. On three quasimetric learning experiments, IQEs show strong approximation and generalization abilities, leading to better performance and improved efficiency over prior methods. Project Page: https://www.tongzhouwang.info/interval_quasimetric_embedding Quasimetric Learning Code Package: https://www.github.com/quasimetric-learning/torch-quasimetric

* NeurIPS 2022 NeurReps Workshop Proceedings Track

Via

Access Paper or Ask Questions

Powderworld: A Platform for Understanding Generalization via Rich Task Distributions

Nov 23, 2022

Kevin Frans, Phillip Isola

Figure 1 for Powderworld: A Platform for Understanding Generalization via Rich Task Distributions

Figure 2 for Powderworld: A Platform for Understanding Generalization via Rich Task Distributions

Figure 3 for Powderworld: A Platform for Understanding Generalization via Rich Task Distributions

Figure 4 for Powderworld: A Platform for Understanding Generalization via Rich Task Distributions

Abstract:One of the grand challenges of reinforcement learning is the ability to generalize to new tasks. However, general agents require a set of rich, diverse tasks to train on. Designing a `foundation environment' for such tasks is tricky -- the ideal environment would support a range of emergent phenomena, an expressive task space, and fast runtime. To take a step towards addressing this research bottleneck, this work presents Powderworld, a lightweight yet expressive simulation environment running directly on the GPU. Within Powderworld, two motivating challenges distributions are presented, one for world-modelling and one for reinforcement learning. Each contains hand-designed test tasks to examine generalization. Experiments indicate that increasing the environment's complexity improves generalization for world models and certain reinforcement learning agents, yet may inhibit learning in high-variance environments. Powderworld aims to support the study of generalization by providing a source of diverse tasks arising from the same core rules.

Via

Access Paper or Ask Questions

Totems: Physical Objects for Verifying Visual Integrity

Sep 26, 2022

Jingwei Ma, Lucy Chai, Minyoung Huh, Tongzhou Wang, Ser-Nam Lim, Phillip Isola, Antonio Torralba

Figure 1 for Totems: Physical Objects for Verifying Visual Integrity

Figure 2 for Totems: Physical Objects for Verifying Visual Integrity

Figure 3 for Totems: Physical Objects for Verifying Visual Integrity

Figure 4 for Totems: Physical Objects for Verifying Visual Integrity

Abstract:We introduce a new approach to image forensics: placing physical refractive objects, which we call totems, into a scene so as to protect any photograph taken of that scene. Totems bend and redirect light rays, thus providing multiple, albeit distorted, views of the scene within a single image. A defender can use these distorted totem pixels to detect if an image has been manipulated. Our approach unscrambles the light rays passing through the totems by estimating their positions in the scene and using their known geometric and material properties. To verify a totem-protected image, we detect inconsistencies between the scene reconstructed from totem viewpoints and the scene's appearance from the camera viewpoint. Such an approach makes the adversarial manipulation task more difficult, as the adversary must modify both the totem and image pixels in a geometrically consistent manner without knowing the physical properties of the totem. Unlike prior learning-based approaches, our method does not require training on datasets of specific manipulations, and instead uses physical properties of the scene and camera to solve the forensics problem.

* ECCV 2022 camera ready version; project page https://jingweim.github.io/totems/

Via

Access Paper or Ask Questions