Alert button
Picture for Liunian Harold Li

Liunian Harold Li

Alert button

DesCo: Learning Object Recognition with Rich Language Descriptions

Jun 24, 2023
Liunian Harold Li, Zi-Yi Dou, Nanyun Peng, Kai-Wei Chang

Figure 1 for DesCo: Learning Object Recognition with Rich Language Descriptions
Figure 2 for DesCo: Learning Object Recognition with Rich Language Descriptions
Figure 3 for DesCo: Learning Object Recognition with Rich Language Descriptions
Figure 4 for DesCo: Learning Object Recognition with Rich Language Descriptions

Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.

Viaarxiv icon

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

Jun 24, 2023
Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi

Figure 1 for Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
Figure 2 for Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
Figure 3 for Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
Figure 4 for Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

Chain-of-thought prompting (e.g., "Let's think step-by-step") primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.

* ACL 2023 
Viaarxiv icon

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

Jun 02, 2023
Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F. Yang, Kai-Wei Chang

Figure 1 for MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
Figure 2 for MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
Figure 3 for MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
Figure 4 for MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

Large-scale language models have shown the ability to adapt to a new task via conditioning on a few demonstrations (i.e., in-context learning). However, in the vision-language domain, most large-scale pre-trained vision-language (VL) models do not possess the ability to conduct in-context learning. How can we enable in-context learning for VL models? In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to VL domain? Specifically, we first meta-trains a language model to perform in-context learning on NLP tasks (as in MetaICL); then we transfer this model to perform VL tasks by attaching a visual encoder. Our experiments suggest that indeed in-context learning ability can be transferred cross modalities: our model considerably improves the in-context learning capability on VL tasks and can even compensate for the size of the model significantly. On VQA, OK-VQA, and GQA, our method could outperform the baseline model while having 20 times fewer parameters.

Viaarxiv icon

GLIPv2: Unifying Localization and Vision-Language Understanding

Jun 12, 2022
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

Figure 1 for GLIPv2: Unifying Localization and Vision-Language Understanding
Figure 2 for GLIPv2: Unifying Localization and Vision-Language Understanding
Figure 3 for GLIPv2: Unifying Localization and Vision-Language Understanding
Figure 4 for GLIPv2: Unifying Localization and Vision-Language Understanding

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

* Code will be released at https://github.com/microsoft/GLIP 
Viaarxiv icon

DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation

May 25, 2022
Jingnong Qu, Liunian Harold Li, Jieyu Zhao, Sunipa Dev, Kai-Wei Chang

Figure 1 for DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Figure 2 for DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Figure 3 for DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Figure 4 for DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation

Disinformation has become a serious problem on social media. In particular, given their short format, visual attraction, and humorous nature, memes have a significant advantage in dissemination among online communities, making them an effective vehicle for the spread of disinformation. We present DisinfoMeme to help detect disinformation memes. The dataset contains memes mined from Reddit covering three current topics: the COVID-19 pandemic, the Black Lives Matter movement, and veganism/vegetarianism. The dataset poses multiple unique challenges: limited data and label imbalance, reliance on external knowledge, multimodal reasoning, layout dependency, and noise from OCR. We test multiple widely-used unimodal and multimodal models on this dataset. The experiments show that the room for improvement is still huge for current models.

Viaarxiv icon

On the Paradox of Learning to Reason from Data

May 24, 2022
Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, Guy Van den Broeck

Figure 1 for On the Paradox of Learning to Reason from Data
Figure 2 for On the Paradox of Learning to Reason from Data
Figure 3 for On the Paradox of Learning to Reason from Data
Figure 4 for On the Paradox of Learning to Reason from Data

Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accuracy on in-distribution test examples while failing to generalize to other data distributions over the exact same problem space. Our study provides an explanation for this paradox: instead of learning to emulate the correct reasoning function, BERT has in fact learned statistical features that inherently exist in logical reasoning problems. We also show that it is infeasible to jointly remove statistical features from data, illustrating the difficulty of learning to reason in general. Our result naturally extends to other neural models and unveils the fundamental difference between learning to reason and learning to achieve high performance on NLP benchmarks using statistical features.

* Table 1 & 2 numbers were out-dated in v1; we have updated them; the observations and conclusions remain unchanged 
Viaarxiv icon

GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

May 24, 2022
Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, Kai-Wei Chang

Figure 1 for GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models
Figure 2 for GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models
Figure 3 for GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models
Figure 4 for GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

Recent work has shown that Pre-trained Language Models (PLMs) have the ability to store the relational knowledge from pre-training data in their model parameters. However, it is not clear up to what extent do PLMs store geo-diverse commonsense knowledge, the knowledge associated with a culture and only shared locally. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. Here, we wish to probe if PLMs can predict red and white as the color of the bridal dress when queried for American and Chinese weddings, respectively. To this end, we introduce a framework for geo-diverse commonsense probing on multilingual PLMs (mPLMs) and introduce a corresponding benchmark Geo-diverse Commonsense Multilingual Language Model Analysis (GeoMLAMA) dataset. GeoMLAMA contains 3125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard mPLMs which include variants of mBERT, XLM, mT5, and XGLM on GeoMLAMA. Interestingly, we find that 1) larger mPLM variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) mPLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country.

Viaarxiv icon

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Apr 20, 2022
Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu, Jianfeng Gao

Figure 1 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Figure 2 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Figure 3 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Figure 4 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets/tasks. However, it remains a challenge to evaluate the transferablity of these foundation models due to the lack of easy-to-use toolkits for fair benchmarking. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark to compare and evaluate pre-trained language-augmented visual models. Several highlights include: (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to ensure the fairness in model adaption. To leverage the full power of language-augmented visual models, novel language-aware initialization methods are proposed to significantly improve the adaption performance. (iii) Metrics. A variety of evaluation metrics are used, including sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). We will release our toolkit and evaluation platforms for the research community.

* Preprint. The first two authors contribute equally 
Viaarxiv icon

RegionCLIP: Region-based Language-Image Pretraining

Dec 16, 2021
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

Figure 1 for RegionCLIP: Region-based Language-Image Pretraining
Figure 2 for RegionCLIP: Region-based Language-Image Pretraining
Figure 3 for RegionCLIP: Region-based Language-Image Pretraining
Figure 4 for RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual concepts. Our method leverages a CLIP model to match image regions with template captions and then pretrains our model to align these region-text pairs in the feature space. When transferring our pretrained model to the open-vocabulary object detection tasks, our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets, respectively. Moreoever, the learned region representations support zero-shot inference for object detection, showing promising results on both COCO and LVIS datasets. Our code is available at https://github.com/microsoft/RegionCLIP.

* Technical report 
Viaarxiv icon