Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dina Katabi

Scaling Laws of Synthetic Images for Model Training for Now

Dec 07, 2023

Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, Yonglong Tian

Figure 1 for Scaling Laws of Synthetic Images for Model Training for Now

Figure 2 for Scaling Laws of Synthetic Images for Model Training for Now

Figure 3 for Scaling Laws of Synthetic Images for Model Training for Now

Figure 4 for Scaling Laws of Synthetic Images for Model Training for Now

Abstract:Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.

Via

Access Paper or Ask Questions

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Oct 05, 2023

Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan

Figure 1 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Figure 2 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Figure 3 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Figure 4 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Abstract:Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

Via

Access Paper or Ask Questions

Unsupervised Object Localization with Representer Point Selection

Sep 08, 2023

Yeonghwan Song, Seokwoo Jang, Dina Katabi, Jeany Son

Figure 1 for Unsupervised Object Localization with Representer Point Selection

Figure 2 for Unsupervised Object Localization with Representer Point Selection

Figure 3 for Unsupervised Object Localization with Representer Point Selection

Figure 4 for Unsupervised Object Localization with Representer Point Selection

Abstract:We propose a novel unsupervised object localization method that allows us to explain the predictions of the model by utilizing self-supervised pre-trained models without additional finetuning. Existing unsupervised and self-supervised object localization methods often utilize class-agnostic activation maps or self-similarity maps of a pre-trained model. Although these maps can offer valuable information for localization, their limited ability to explain how the model makes predictions remains challenging. In this paper, we propose a simple yet effective unsupervised object localization method based on representer point selection, where the predictions of the model can be represented as a linear combination of representer values of training points. By selecting representer points, which are the most important examples for the model predictions, our model can provide insights into how the model predicts the foreground object by providing relevant examples as well as their importance. Our method outperforms the state-of-the-art unsupervised and self-supervised object localization methods on various datasets with significant margins and even outperforms recent weakly supervised and few-shot methods.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Improving CLIP Training with Language Rewrites

May 31, 2023

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, Yonglong Tian

Figure 1 for Improving CLIP Training with Language Rewrites

Figure 2 for Improving CLIP Training with Language Rewrites

Figure 3 for Improving CLIP Training with Language Rewrites

Figure 4 for Improving CLIP Training with Language Rewrites

Abstract:Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.

Via

Access Paper or Ask Questions

Reparo: Loss-Resilient Generative Codec for Video Conferencing

May 23, 2023

Tianhong Li, Vibhaalakshmi Sivaraman, Lijie Fan, Mohammad Alizadeh, Dina Katabi

Figure 1 for Reparo: Loss-Resilient Generative Codec for Video Conferencing

Figure 2 for Reparo: Loss-Resilient Generative Codec for Video Conferencing

Figure 3 for Reparo: Loss-Resilient Generative Codec for Video Conferencing

Figure 4 for Reparo: Loss-Resilient Generative Codec for Video Conferencing

Abstract:Loss of packets in video conferencing often results in poor quality and video freezing. Attempting to retransmit the lost packets is usually not practical due to the requirement for real-time playback. Using Forward Error Correction (FEC) to recover the lost packets is challenging since it is difficult to determine the appropriate level of redundancy. In this paper, we propose a framework called Reparo for creating loss-resilient video conferencing using generative deep learning models. Our approach involves generating missing information when a frame or part of a frame is lost. This generation is conditioned on the data received so far, and the model's knowledge of how people look, dress, and interact in the visual world. Our experiments on publicly available video conferencing datasets show that Reparo outperforms state-of-the-art FEC-based video conferencing in terms of both video quality (measured by PSNR) and video freezes.

Via

Access Paper or Ask Questions

Change is Hard: A Closer Look at Subpopulation Shift

Feb 23, 2023

Yuzhe Yang, Haoran Zhang, Dina Katabi, Marzyeh Ghassemi

Figure 1 for Change is Hard: A Closer Look at Subpopulation Shift

Figure 2 for Change is Hard: A Closer Look at Subpopulation Shift

Figure 3 for Change is Hard: A Closer Look at Subpopulation Shift

Figure 4 for Change is Hard: A Closer Look at Subpopulation Shift

Abstract:Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https://github.com/YyzHarry/SubpopBench.

* Code and data are available at https://github.com/YyzHarry/SubpopBench

Via

Access Paper or Ask Questions

Contactless Oxygen Monitoring with Gated Transformer

Dec 06, 2022

Hao He, Yuan Yuan, Ying-Cong Chen, Peng Cao, Dina Katabi

Figure 1 for Contactless Oxygen Monitoring with Gated Transformer

Figure 2 for Contactless Oxygen Monitoring with Gated Transformer

Figure 3 for Contactless Oxygen Monitoring with Gated Transformer

Figure 4 for Contactless Oxygen Monitoring with Gated Transformer

Abstract:With the increasing popularity of telehealth, it becomes critical to ensure that basic physiological signals can be monitored accurately at home, with minimal patient overhead. In this paper, we propose a contactless approach for monitoring patients' blood oxygen at home, simply by analyzing the radio signals in the room, without any wearable devices. We extract the patients' respiration from the radio signals that bounce off their bodies and devise a novel neural network that infers a patient's oxygen estimates from their breathing signal. Our model, called \emph{Gated BERT-UNet}, is designed to adapt to the patient's medical indices (e.g., gender, sleep stages). It has multiple predictive heads and selects the most suitable head via a gate controlled by the person's physiological indices. Extensive empirical results show that our model achieves high accuracy on both medical and radio datasets.

* 19 pages, Workshop on Learning from Time Series for Health, NeurIPS 2022

Via

Access Paper or Ask Questions

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Nov 16, 2022

Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan

Abstract:Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

Via

Access Paper or Ask Questions

SimPer: Simple Self-Supervised Learning of Periodic Targets

Oct 06, 2022

Yuzhe Yang, Xin Liu, Jiang Wu, Silviu Borac, Dina Katabi, Ming-Zher Poh, Daniel McDuff

Figure 1 for SimPer: Simple Self-Supervised Learning of Periodic Targets

Figure 2 for SimPer: Simple Self-Supervised Learning of Periodic Targets

Figure 3 for SimPer: Simple Self-Supervised Learning of Periodic Targets

Figure 4 for SimPer: Simple Self-Supervised Learning of Periodic Targets

Abstract:From human physiology to environmental evolution, important processes in nature often exhibit meaningful and strong periodic or quasi-periodic changes. Due to their inherent label scarcity, learning useful representations for periodic tasks with limited or no supervision is of great benefit. Yet, existing self-supervised learning (SSL) methods overlook the intrinsic periodicity in data, and fail to learn representations that capture periodic or frequency attributes. In this paper, we present SimPer, a simple contrastive SSL regime for learning periodic information in data. To exploit the periodic inductive bias, SimPer introduces customized augmentations, feature similarity measures, and a generalized contrastive loss for learning efficient and robust periodic representations. Extensive experiments on common real-world tasks in human behavior analysis, environmental sensing, and healthcare domains verify the superior performance of SimPer compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Code and data are available at: https://github.com/YyzHarry/SimPer.

* Code and data are available at https://github.com/YyzHarry/SimPer

Via

Access Paper or Ask Questions

Supervised Contrastive Regression

Oct 03, 2022

Kaiwen Zha, Peng Cao, Yuzhe Yang, Dina Katabi

Figure 1 for Supervised Contrastive Regression

Figure 2 for Supervised Contrastive Regression

Figure 3 for Supervised Contrastive Regression

Figure 4 for Supervised Contrastive Regression

Abstract:Deep regression models typically learn in an end-to-end fashion and do not explicitly try to learn a regression-aware representation. Their representations tend to be fragmented and fail to capture the continuous nature of regression tasks. In this paper, we propose Supervised Contrastive Regression (SupCR), a framework that learns a regression-aware representation by contrasting samples against each other based on their target distance. SupCR is orthogonal to existing regression models, and can be used in combination with such models to improve performance. Extensive experiments using five real-world regression datasets that span computer vision, human-computer interaction, and healthcare show that using SupCR achieves the state-of-the-art performance and consistently improves prior regression baselines on all datasets, tasks, and input modalities. SupCR also improves robustness to data corruptions, resilience to reduced training data, performance on transfer learning, and generalization to unseen targets.

* The first two authors contributed equally to this paper

Via

Access Paper or Ask Questions