Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boqing Gong

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Dec 14, 2021

Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown

Figure 1 for Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Figure 2 for Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Figure 3 for Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Figure 4 for Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Abstract:In this paper, we explore the possibility of building a unified foundation model that can be adapted to both vision-only and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose two novel techniques: (i) We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals for the joint training; (ii) We propose a novel gradient masking strategy to balance the parameter updates from the image and text pre-training losses. We evaluate the jointly pre-trained transformer by fine-tuning it on image classification tasks and natural language understanding tasks, respectively. The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

* preliminary work

Via

Access Paper or Ask Questions

Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

Dec 09, 2021

Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

Figure 1 for Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

Figure 2 for Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

Figure 3 for Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

Figure 4 for Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

Abstract:A modern self-supervised learning algorithm typically enforces persistency of the representations of an instance across views. While being very effective on learning holistic image and video representations, such an approach becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present the Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) framework to effectively learn spatio-temporally fine-grained representations using self-supervision. We first design a region-based self-supervised pretext task which requires the model to learn to transform instance representations from one view to another guided by context features. Further, we introduce a simple network design that effectively reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and ConST-CL achieves state-of-the-art results on four datasets. For spatio-temporal action localization, ConST-CL achieves 39.4% mAP with ground-truth boxes and 30.5% mAP with detected boxes on the AVA-Kinetics validation set. For object tracking, ConST-CL achieves 78.1% precision and 55.2% success scores on OTB2015. Furthermore, ConST-CL achieves 94.8% and 71.9% top-1 fine-tuning accuracy on video action recognition datasets, UCF101 and HMDB51 respectively. We plan to release our code and models to the public.

Via

Access Paper or Ask Questions

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Dec 08, 2021

Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui

Figure 1 for Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Figure 2 for Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Figure 3 for Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Figure 4 for Exploring Temporal Granularity in Self-Supervised Video Representation Learning

Abstract:This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.

Via

Access Paper or Ask Questions

Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Model Training

Sep 18, 2021

Zihang Zou, Boqing Gong, Liqiang Wang

Figure 1 for Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Model Training

Figure 2 for Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Model Training

Figure 3 for Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Model Training

Figure 4 for Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Model Training

Abstract:In this paper, we raise up an emerging personal data protection problem where user personal data (e.g. images) could be inappropriately exploited to train deep neural network models without authorization. To solve this problem, we revisit traditional watermarking in advanced machine learning settings. By embedding a watermarking signature using specialized linear color transformation to user images, neural models will be imprinted with such a signature if training data include watermarked images. Then, a third-party verifier can verify potential unauthorized usage by inferring the watermark signature from neural models. We further explore the desired properties of watermarking and signature space for convincing verification. Through extensive experiments, we show empirically that linear color transformation is effective in protecting user's personal images for various realistic settings. To the best of our knowledge, this is the first work to protect users' personal data from unauthorized usage in neural network training.

Via

Access Paper or Ask Questions

Federated Multi-Target Domain Adaptation

Aug 17, 2021

Chun-Han Yao, Boqing Gong, Yin Cui, Hang Qi, Yukun Zhu, Ming-Hsuan Yang

Figure 1 for Federated Multi-Target Domain Adaptation

Figure 2 for Federated Multi-Target Domain Adaptation

Figure 3 for Federated Multi-Target Domain Adaptation

Figure 4 for Federated Multi-Target Domain Adaptation

Abstract:Federated learning methods enable us to train machine learning models on distributed user data while preserving its privacy. However, it is not always feasible to obtain high-quality supervisory signals from users, especially for vision tasks. Unlike typical federated settings with labeled client data, we consider a more practical scenario where the distributed client data is unlabeled, and a centralized labeled dataset is available on the server. We further take the server-client and inter-client domain shifts into account and pose a domain adaptation problem with one source (centralized server data) and multiple targets (distributed client data). Within this new Federated Multi-Target Domain Adaptation (FMTDA) task, we analyze the model performance of exiting domain adaptation methods and propose an effective DualAdapt method to address the new challenges. Extensive experimental results on image classification and semantic segmentation tasks demonstrate that our method achieves high accuracy, incurs minimal communication cost, and requires low computational resources on client devices.

Via

Access Paper or Ask Questions

On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

Jul 05, 2021

Tai-Yu Pan, Cheng Zhang, Yandong Li, Hexiang Hu, Dong Xuan, Soravit Changpinyo, Boqing Gong, Wei-Lun Chao

Figure 1 for On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

Figure 2 for On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

Figure 3 for On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

Figure 4 for On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

Abstract:Vanilla models for object detection and instance segmentation suffer from the heavy bias toward detecting frequent objects in the long-tailed setting. Existing methods address this issue mostly during training, e.g., by re-sampling or re-weighting. In this paper, we investigate a largely overlooked approach -- post-processing calibration of confidence scores. We propose NorCal, Normalized Calibration for long-tailed object detection and instance segmentation, a simple and straightforward recipe that reweighs the predicted scores of each class by its training sample size. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance. On the LVIS dataset, NorCal can effectively improve nearly all the baseline models not only on rare classes but also on common and frequent classes. Finally, we conduct extensive analysis and ablation studies to offer insights into various modeling choices and mechanisms of our approach.

Via

Access Paper or Ask Questions

Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Jun 18, 2021

Marco Fornoni, Chaochao Yan, Liangchen Luo, Kimberly Wilber, Alex Stark, Yin Cui, Boqing Gong, Andrew Howard

Figure 1 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Figure 2 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Figure 3 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Figure 4 for Bridging the Gap Between Object Detection and User Intent via Query-Modulation

Abstract:When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.

Via

Access Paper or Ask Questions

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Jun 03, 2021

Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

Figure 1 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Figure 2 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Figure 3 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Figure 4 for When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Abstract:Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.

Via

Access Paper or Ask Questions

Adversarially Adaptive Normalization for Single Domain Generalization

Jun 01, 2021

Xinjie Fan, Qifei Wang, Junjie Ke, Feng Yang, Boqing Gong, Mingyuan Zhou

Figure 1 for Adversarially Adaptive Normalization for Single Domain Generalization

Figure 2 for Adversarially Adaptive Normalization for Single Domain Generalization

Figure 3 for Adversarially Adaptive Normalization for Single Domain Generalization

Figure 4 for Adversarially Adaptive Normalization for Single Domain Generalization

Abstract:Single domain generalization aims to learn a model that performs well on many unseen domains with only one domain data for training. Existing works focus on studying the adversarial domain augmentation (ADA) to improve the model's generalization capability. The impact on domain generalization of the statistics of normalization layers is still underinvestigated. In this paper, we propose a generic normalization approach, adaptive standardization and rescaling normalization (ASR-Norm), to complement the missing part in previous works. ASR-Norm learns both the standardization and rescaling statistics via neural networks. This new form of normalization can be viewed as a generic form of the traditional normalizations. When trained with ADA, the statistics in ASR-Norm are learned to be adaptive to the data coming from different domains, and hence improves the model generalization performance across domains, especially on the target domain with large discrepancy from the source domain. The experimental results show that ASR-Norm can bring consistent improvement to the state-of-the-art ADA approaches by 1.6%, 2.7%, and 6.3% averagely on the Digits, CIFAR-10-C, and PACS benchmarks, respectively. As a generic tool, the improvement introduced by ASR-Norm is agnostic to the choice of ADA methods.

* CVPR 2021

Via

Access Paper or Ask Questions

2.5D Visual Relationship Detection

Apr 26, 2021

Yu-Chuan Su, Soravit Changpinyo, Xiangning Chen, Sathish Thoppay, Cho-Jui Hsieh, Lior Shapira, Radu Soricut, Hartwig Adam, Matthew Brown, Ming-Hsuan Yang(+1 more)

Figure 1 for 2.5D Visual Relationship Detection

Figure 2 for 2.5D Visual Relationship Detection

Figure 3 for 2.5D Visual Relationship Detection

Figure 4 for 2.5D Visual Relationship Detection

Abstract:Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semantics. To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is egocentric, using the camera's viewpoint as a common reference for all 2.5D relationships. Unlike depth estimation, 2.5VRD is object-centric and not only focuses on depth. To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images. We analyze this dataset and conduct extensive experiments including benchmarking multiple state-of-the-art VRD models on this task. Our results show that existing models largely rely on semantic cues and simple heuristics to solve 2.5VRD, motivating further research on models for 2.5D perception. The new dataset is available at https://github.com/google-research-datasets/2.5vrd.

Via

Access Paper or Ask Questions