Alert button
Picture for Hangbo Bao

Hangbo Bao

Alert button

A Unified View of Masked Image Modeling

Oct 19, 2022
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei

Figure 1 for A Unified View of Masked Image Modeling
Figure 2 for A Unified View of Masked Image Modeling
Figure 3 for A Unified View of Masked Image Modeling
Figure 4 for A Unified View of Masked Image Modeling

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

Viaarxiv icon

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Aug 31, 2022
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei

Figure 1 for Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Figure 2 for Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Figure 3 for Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Figure 4 for Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

* 18 pages 
Viaarxiv icon

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Aug 12, 2022
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei

Figure 1 for BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
Figure 2 for BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
Figure 3 for BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
Figure 4 for BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most methods still operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this study, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we introduce vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Moreover, we encourage the model to explicitly aggregate patch information into a global image representation, which facilities linear probing. Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. The code and pretrained models are available at https://aka.ms/beit.

Viaarxiv icon

VL-BEiT: Generative Vision-Language Pretraining

Jun 02, 2022
Hangbo Bao, Wenhui Wang, Li Dong, Furu Wei

Figure 1 for VL-BEiT: Generative Vision-Language Pretraining
Figure 2 for VL-BEiT: Generative Vision-Language Pretraining
Figure 3 for VL-BEiT: Generative Vision-Language Pretraining
Figure 4 for VL-BEiT: Generative Vision-Language Pretraining

We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.

Viaarxiv icon

THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption

Jun 02, 2022
Tianyu Chen, Hangbo Bao, Shaohan Huang, Li Dong, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei

As more and more pre-trained language models adopt on-cloud deployment, the privacy issues grow quickly, mainly for the exposure of plain-text user data (e.g., search history, medical record, bank account). Privacy-preserving inference of transformer models is on the demand of cloud service users. To protect privacy, it is an attractive choice to compute only with ciphertext in homomorphic encryption (HE). However, enabling pre-trained models inference on ciphertext data is difficult due to the complex computations in transformer blocks, which are not supported by current HE tools yet. In this work, we introduce $\textit{THE-X}$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models developed by popular frameworks. $\textit{THE-X}$ proposes a workflow to deal with complex computation in transformer networks, including all the non-polynomial functions like GELU, softmax, and LayerNorm. Experiments reveal our proposed $\textit{THE-X}$ can enable transformer inference on encrypted data for different downstream tasks, all with negligible performance drop but enjoying the theory-guaranteed privacy-preserving advantage.

* Findings of ACL 2022 
Viaarxiv icon

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

Feb 07, 2022
Yuxin Fang, Li Dong, Hangbo Bao, Xinggang Wang, Furu Wei

Figure 1 for Corrupted Image Modeling for Self-Supervised Visual Pre-Training
Figure 2 for Corrupted Image Modeling for Self-Supervised Visual Pre-Training
Figure 3 for Corrupted Image Modeling for Self-Supervised Visual Pre-Training
Figure 4 for Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation. For example, 300-epoch CIM pre-trained vanilla ViT-Base/16 and ResNet-50 obtain 83.3 and 80.6 Top-1 fine-tuning accuracy on ImageNet-1K image classification respectively.

* Preprint. Work in progress. Code will be released at https://aka.ms/beit-cim 
Viaarxiv icon

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Nov 03, 2021
Wenhui Wang, Hangbo Bao, Li Dong, Furu Wei

Figure 1 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Figure 2 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Figure 3 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Figure 4 for VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA and NLVR2. The code and pretrained models are available at https://aka.ms/vlmo.

* Work in progress 
Viaarxiv icon

s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Oct 26, 2021
Hangbo Bao, Li Dong, Wenhui Wang, Nan Yang, Furu Wei

Figure 1 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning
Figure 2 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning
Figure 3 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning
Figure 4 for s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Pretrained bidirectional Transformers, such as BERT, have achieved significant improvements in a wide variety of language understanding tasks, while it is not straightforward to directly apply them for natural language generation. In this paper, we present a sequence-to-sequence fine-tuning toolkit s2s-ft, which adopts pretrained Transformers for conditional generation tasks. Inspired by UniLM, we implement three sequence-to-sequence fine-tuning algorithms, namely, causal fine-tuning, masked fine-tuning, and pseudo-masked fine-tuning. By leveraging the existing pretrained bidirectional Transformers, experimental results show that s2s-ft achieves strong performance on several benchmarks of abstractive summarization, and question generation. Moreover, we demonstrate that the package s2s-ft supports both monolingual and multilingual NLG tasks. The s2s-ft toolkit is available at https://github.com/microsoft/unilm/tree/master/s2s-ft.

* Demo paper for the s2s-ft toolkit: https://github.com/microsoft/unilm/tree/master/s2s-ft 
Viaarxiv icon