Alert button
Picture for Xinlong Wang

Xinlong Wang

Alert button

Generative Pretraining in Multimodality

Jul 11, 2023
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

Figure 1 for Generative Pretraining in Multimodality
Figure 2 for Generative Pretraining in Multimodality
Figure 3 for Generative Pretraining in Multimodality
Figure 4 for Generative Pretraining in Multimodality

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

* Code and Demo: https://github.com/baaivision/Emu 
Viaarxiv icon

Fine-Grained Visual Prompting

Jun 07, 2023
Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang

Figure 1 for Fine-Grained Visual Prompting
Figure 2 for Fine-Grained Visual Prompting
Figure 3 for Fine-Grained Visual Prompting
Figure 4 for Fine-Grained Visual Prompting

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. Code and models will be made available.

Viaarxiv icon

Towards Better Entity Linking with Multi-View Enhanced Distillation

May 27, 2023
Yi Liu, Yuan Tian, Jianxun Lian, Xinlong Wang, Yanan Cao, Fang Fang, Wen Zhang, Haizhen Huang, Denvy Deng, Qi Zhang

Figure 1 for Towards Better Entity Linking with Multi-View Enhanced Distillation
Figure 2 for Towards Better Entity Linking with Multi-View Enhanced Distillation
Figure 3 for Towards Better Entity Linking with Multi-View Enhanced Distillation
Figure 4 for Towards Better Entity Linking with Multi-View Enhanced Distillation

Dense retrieval is widely used for entity linking to retrieve entities from large-scale knowledge bases. Mainstream techniques are based on a dual-encoder framework, which encodes mentions and entities independently and calculates their relevances via rough interaction metrics, resulting in difficulty in explicitly modeling multiple mention-relevant parts within entities to match divergent mentions. Aiming at learning entity representations that can match divergent mentions, this paper proposes a Multi-View Enhanced Distillation (MVD) framework, which can effectively transfer knowledge of multiple fine-grained and mention-relevant parts within entities from cross-encoders to dual-encoders. Each entity is split into multiple views to avoid irrelevant information being over-squashed into the mention-relevant view. We further design cross-alignment and self-alignment mechanisms for this framework to facilitate fine-grained knowledge distillation from the teacher model to the student model. Meanwhile, we reserve a global-view that embeds the entity as a whole to prevent dispersal of uniform information. Experiments show our method achieves state-of-the-art performance on several entity linking benchmarks.

* Accepted by ACL 2023 Main Conference 
Viaarxiv icon

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

May 22, 2023
Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen

Figure 1 for Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
Figure 2 for Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
Figure 3 for Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
Figure 4 for Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. Even though individual models have limited capabilities, combining multiple such models properly can lead to positive synergies and unleash their full potential. In this work, we present Matcher, which segments anything with one shot by integrating an all-purpose feature extraction model and a class-agnostic segmentation model. Naively connecting the models results in unsatisfying performance, e.g., the models tend to generate matching outliers and false-positive mask fragments. To address these issues, we design a bidirectional matching strategy for accurate cross-image semantic dense matching and a robust prompt sampler for mask proposal generation. In addition, we propose a novel instance-level matching strategy for controllable mask merging. The proposed Matcher method delivers impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ for one-shot semantic segmentation, surpassing the state-of-the-art specialist model by 1.6%. In addition, our visualization results show open-world generality and flexibility on images in the wild. The code shall be released at https://github.com/aim-uofa/Matcher.

* Technical Report 
Viaarxiv icon

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Apr 13, 2023
Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, Chunhua Shen

Figure 1 for Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Figure 2 for Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Figure 3 for Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Figure 4 for Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code is made available at \url{https://github.com/baaivision/vid2vid-zero}.

* Add appendix 
Viaarxiv icon

SegGPT: Segmenting Everything In Context

Apr 06, 2023
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang

Figure 1 for SegGPT: Segmenting Everything In Context
Figure 2 for SegGPT: Segmenting Everything In Context
Figure 3 for SegGPT: Segmenting Everything In Context
Figure 4 for SegGPT: Segmenting Everything In Context

We present SegGPT, a generalist model for segmenting everything in context. We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images. The training of SegGPT is formulated as an in-context coloring problem with random color mapping for each data sample. The objective is to accomplish diverse tasks according to the context, rather than relying on specific colors. After training, SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks, including few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation. Our results show strong capabilities in segmenting in-domain and out-of-domain targets, either qualitatively or quantitatively.

* Code and Demo: https://github.com/baaivision/Painter 
Viaarxiv icon

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Mar 27, 2023
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao

Figure 1 for EVA-CLIP: Improved Training Techniques for CLIP at Scale
Figure 2 for EVA-CLIP: Improved Training Techniques for CLIP at Scale
Figure 3 for EVA-CLIP: Improved Training Techniques for CLIP at Scale
Figure 4 for EVA-CLIP: Improved Training Techniques for CLIP at Scale

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.

* To Rei and the moon. Code & Models: https://github.com/baaivision/EVA/tree/master/EVA-CLIP 
Viaarxiv icon

EVA-02: A Visual Representation for Neon Genesis

Mar 22, 2023
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao

Figure 1 for EVA-02: A Visual Representation for Neon Genesis
Figure 2 for EVA-02: A Visual Representation for Neon Genesis
Figure 3 for EVA-02: A Visual Representation for Neon Genesis
Figure 4 for EVA-02: A Visual Representation for Neon Genesis

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02.

* v2: Fix some known issues & typos. v1: To Asuka. Code & Models: https://github.com/baaivision/EVA/tree/master/EVA-02 
Viaarxiv icon