Alert button
Picture for Tianhong Li

Tianhong Li

Alert button

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Oct 05, 2023
Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan

Figure 1 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
Figure 2 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
Figure 3 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
Figure 4 for Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

Viaarxiv icon

Reparo: Loss-Resilient Generative Codec for Video Conferencing

May 23, 2023
Tianhong Li, Vibhaalakshmi Sivaraman, Lijie Fan, Mohammad Alizadeh, Dina Katabi

Figure 1 for Reparo: Loss-Resilient Generative Codec for Video Conferencing
Figure 2 for Reparo: Loss-Resilient Generative Codec for Video Conferencing
Figure 3 for Reparo: Loss-Resilient Generative Codec for Video Conferencing
Figure 4 for Reparo: Loss-Resilient Generative Codec for Video Conferencing

Loss of packets in video conferencing often results in poor quality and video freezing. Attempting to retransmit the lost packets is usually not practical due to the requirement for real-time playback. Using Forward Error Correction (FEC) to recover the lost packets is challenging since it is difficult to determine the appropriate level of redundancy. In this paper, we propose a framework called Reparo for creating loss-resilient video conferencing using generative deep learning models. Our approach involves generating missing information when a frame or part of a frame is lost. This generation is conditioned on the data received so far, and the model's knowledge of how people look, dress, and interact in the visual world. Our experiments on publicly available video conferencing datasets show that Reparo outperforms state-of-the-art FEC-based video conferencing in terms of both video quality (measured by PSNR) and video freezes.

Viaarxiv icon

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Nov 16, 2022
Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan

Figure 1 for MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Figure 2 for MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Figure 3 for MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Figure 4 for MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

Viaarxiv icon

Unsupervised Learning for Human Sensing Using Radio Signals

Jul 06, 2022
Tianhong Li, Lijie Fan, Yuan Yuan, Dina Katabi

Figure 1 for Unsupervised Learning for Human Sensing Using Radio Signals
Figure 2 for Unsupervised Learning for Human Sensing Using Radio Signals
Figure 3 for Unsupervised Learning for Human Sensing Using Radio Signals
Figure 4 for Unsupervised Learning for Human Sensing Using Radio Signals

There is a growing literature demonstrating the feasibility of using Radio Frequency (RF) signals to enable key computer vision tasks in the presence of occlusions and poor lighting. It leverages that RF signals traverse walls and occlusions to deliver through-wall pose estimation, action recognition, scene captioning, and human re-identification. However, unlike RGB datasets which can be labeled by human workers, labeling RF signals is a daunting task because such signals are not human interpretable. Yet, it is fairly easy to collect unlabelled RF signals. It would be highly beneficial to use such unlabeled RF data to learn useful representations in an unsupervised manner. Thus, in this paper, we explore the feasibility of adapting RGB-based unsupervised representation learning to RF signals. We show that while contrastive learning has emerged as the main technique for unsupervised representation learning from images and videos, such methods produce poor performance when applied to sensing humans using RF signals. In contrast, predictive unsupervised learning methods learn high-quality representations that can be used for multiple downstream RF-based sensing tasks. Our empirical results show that this approach outperforms state-of-the-art RF-based human sensing on various tasks, opening the possibility of unsupervised representation learning from this novel modality.

* WACV 2022. The first three authors contributed equally to this paper 
Viaarxiv icon

Targeted Supervised Contrastive Learning for Long-Tailed Recognition

Nov 27, 2021
Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang, Rogerio Feris, Piotr Indyk, Dina Katabi

Figure 1 for Targeted Supervised Contrastive Learning for Long-Tailed Recognition
Figure 2 for Targeted Supervised Contrastive Learning for Long-Tailed Recognition
Figure 3 for Targeted Supervised Contrastive Learning for Long-Tailed Recognition
Figure 4 for Targeted Supervised Contrastive Learning for Long-Tailed Recognition

Real-world data often exhibits long tail distributions with heavy class imbalance, where the majority classes can dominate the training process and alter the decision boundaries of the minority classes. Recently, researchers have investigated the potential of supervised contrastive learning for long-tailed recognition, and demonstrated that it provides a strong performance gain. In this paper, we show that while supervised contrastive learning can help improve performance, past baselines suffer from poor uniformity brought in by imbalanced data distribution. This poor uniformity manifests in samples from the minority class having poor separability in the feature space. To address this problem, we propose targeted supervised contrastive learning (TSC), which improves the uniformity of the feature distribution on the hypersphere. TSC first generates a set of targets uniformly distributed on a hypersphere. It then makes the features of different classes converge to these distinct and uniformly distributed targets during training. This forces all classes, including minority classes, to maintain a uniform distribution in the feature space, improves class boundaries, and provides better generalization even in the presence of long-tail data. Experiments on multiple datasets show that TSC achieves state-of-the-art performance on long-tailed recognition tasks.

* The first two authors contributed equally to this paper 
Viaarxiv icon

Information-Preserving Contrastive Learning for Self-Supervised Representations

Dec 17, 2020
Tianhong Li, Lijie Fan, Yuan Yuan, Hao He, Yonglong Tian, Dina Katabi

Figure 1 for Information-Preserving Contrastive Learning for Self-Supervised Representations
Figure 2 for Information-Preserving Contrastive Learning for Self-Supervised Representations
Figure 3 for Information-Preserving Contrastive Learning for Self-Supervised Representations
Figure 4 for Information-Preserving Contrastive Learning for Self-Supervised Representations

Contrastive learning is very effective at learning useful representations without supervision. Yet contrastive learning has its limitations. It can learn a shortcut that is irrelevant to the downstream task, and discard relevant information. Past work has addressed this limitation via custom data augmentations that eliminate the shortcut. This solution however does not work for data modalities that are not interpretable by humans, e.g., radio signals. For such modalities, it is hard for a human to guess which shortcuts may exist in the signal, or how to alter the radio signals to eliminate the shortcuts. Even for visual data, sometimes eliminating the shortcut may be undesirable. The shortcut may be irrelevant to one downstream task but important to another. In this case, it is desirable to learn a representation that captures both the shortcut information and the information relevant to the other downstream task. This paper presents information-preserving contrastive learning (IPCL), a new framework for unsupervised representation learning that preserves relevant information even in the presence of shortcuts. We empirically show that IPCL addresses the above problems and outperforms contrastive learning on radio signals and learning RGB data representation with different features that support different downstream tasks.

* The first two authors contributed equally to this paper 
Viaarxiv icon

In-Home Daily-Life Captioning Using Radio Signals

Aug 25, 2020
Lijie Fan, Tianhong Li, Yuan Yuan, Dina Katabi

Figure 1 for In-Home Daily-Life Captioning Using Radio Signals
Figure 2 for In-Home Daily-Life Captioning Using Radio Signals
Figure 3 for In-Home Daily-Life Captioning Using Radio Signals
Figure 4 for In-Home Daily-Life Captioning Using Radio Signals

This paper aims to caption daily life --i.e., to create a textual description of people's activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home's floormap. RF-Diary can further observe and caption people's life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people's 3D dynamics, and use the floormap to help the model learn people's interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions. For more information, please visit our project webpage: http://rf-diary.csail.mit.edu

* ECCV 2020. The first two authors contributed equally to this paper 
Viaarxiv icon

Learning Longterm Representations for Person Re-Identification Using Radio Signals

Apr 02, 2020
Lijie Fan, Tianhong Li, Rongyao Fang, Rumen Hristov, Yuan Yuan, Dina Katabi

Figure 1 for Learning Longterm Representations for Person Re-Identification Using Radio Signals
Figure 2 for Learning Longterm Representations for Person Re-Identification Using Radio Signals
Figure 3 for Learning Longterm Representations for Person Re-Identification Using Radio Signals
Figure 4 for Learning Longterm Representations for Person Re-Identification Using Radio Signals

Person Re-Identification (ReID) aims to recognize a person-of-interest across different places and times. Existing ReID methods rely on images or videos collected using RGB cameras. They extract appearance features like clothes, shoes, hair, etc. Such features, however, can change drastically from one day to the next, leading to inability to identify people over extended time periods. In this paper, we introduce RF-ReID, a novel approach that harnesses radio frequency (RF) signals for longterm person ReID. RF signals traverse clothes and reflect off the human body; thus they can be used to extract more persistent human-identifying features like body size and shape. We evaluate the performance of RF-ReID on longitudinal datasets that span days and weeks, where the person may wear different clothes across days. Our experiments demonstrate that RF-ReID outperforms state-of-the-art RGB-based ReID approaches for long term person ReID. Our results also reveal two interesting features: First since RF signals work in the presence of occlusions and poor lighting, RF-ReID allows for person ReID in such scenarios. Second, unlike photos and videos which reveal personal and private information, RF signals are more privacy-preserving, and hence can help extend person ReID to privacy-concerned domains, like healthcare.

* CVPR 2020. The first three authors contributed equally to this paper 
Viaarxiv icon

Making the Invisible Visible: Action Recognition Through Walls and Occlusions

Sep 20, 2019
Tianhong Li, Lijie Fan, Mingmin Zhao, Yingcheng Liu, Dina Katabi

Figure 1 for Making the Invisible Visible: Action Recognition Through Walls and Occlusions
Figure 2 for Making the Invisible Visible: Action Recognition Through Walls and Occlusions
Figure 3 for Making the Invisible Visible: Action Recognition Through Walls and Occlusions
Figure 4 for Making the Invisible Visible: Action Recognition Through Walls and Occlusions

Understanding people's actions and interactions typically depends on seeing them. Automating the process of action recognition from visual data has been the topic of much research in the computer vision community. But what if it is too dark, or if the person is occluded or behind a wall? In this paper, we introduce a neural network model that can detect human actions through walls and occlusions, and in poor lighting conditions. Our model takes radio frequency (RF) signals as input, generates 3D human skeletons as an intermediate representation, and recognizes actions and interactions of multiple people over time. By translating the input to an intermediate skeleton-based representation, our model can learn from both vision-based and RF-based datasets, and allow the two tasks to help each other. We show that our model achieves comparable accuracy to vision-based action recognition systems in visible scenarios, yet continues to work accurately when people are not visible, hence addressing scenarios that are beyond the limit of today's vision-based action recognition.

* ICCV 2019. The first two authors contributed equally to this paper 
Viaarxiv icon

Knowledge Distillation from Few Samples

Dec 05, 2018
Tianhong Li, Jianguo Li, Zhuang Liu, Changshui Zhang

Figure 1 for Knowledge Distillation from Few Samples
Figure 2 for Knowledge Distillation from Few Samples
Figure 3 for Knowledge Distillation from Few Samples
Figure 4 for Knowledge Distillation from Few Samples

Current knowledge distillation methods require full training data to distill knowledge from a large "teacher" network to a compact "student" network by matching certain statistics between "teacher" and "student" such as softmax outputs and feature responses. This is not only time-consuming but also inconsistent with human cognition in which children can learn knowledge from adults with few examples. This paper proposes a novel and simple method for knowledge distillation from few samples. Taking the assumption that both "teacher" and "student" have the same feature map sizes at each corresponding block, we add a 1x1 conv-layer at the end of each block in the student-net, and align the block-level outputs between "teacher" and "student" by estimating the parameters of the added layer with limited samples. We prove that the added layer can be absorbed/merged into the previous conv-layer to formulate a new conv-layer with the same size of parameters and computation cost as the previous one. Experiments verify that the proposed method is very efficient and effective to distill knowledge from teacher-net to student-net constructing in different ways on various datasets.

Viaarxiv icon