Alert button
Picture for Shih-Fu Chang

Shih-Fu Chang

Alert button

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Jul 03, 2023
Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

Figure 1 for UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Figure 2 for UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Figure 3 for UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Figure 4 for UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine

* 14 pages, 4 figures, ACL 2023 Findings 
Viaarxiv icon

Learning from Children: Improving Image-Caption Pretraining via Curriculum

May 30, 2023
Hammad A. Ayyubi, Rahul Lokesh, Alireza Zareian, Bo Wu, Shih-Fu Chang

Figure 1 for Learning from Children: Improving Image-Caption Pretraining via Curriculum
Figure 2 for Learning from Children: Improving Image-Caption Pretraining via Curriculum
Figure 3 for Learning from Children: Improving Image-Caption Pretraining via Curriculum
Figure 4 for Learning from Children: Improving Image-Caption Pretraining via Curriculum

Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts (nouns) from captions to be aligned to several objects in images. To tackle this problem, we go to the roots -- the best learner, children. We take inspiration from cognitive science studies dealing with children's language learning to propose a curriculum learning framework. The learning begins with easy-to-align image caption pairs containing one concept per caption. The difficulty is progressively increased with each new phase by adding one more concept per caption. Correspondingly, the knowledge acquired in each learning phase is utilized in subsequent phases to effectively constrain the learning problem to aligning one new concept-object pair in each phase. We show that this learning strategy improves over vanilla image-caption training in various settings -- pretraining from scratch, using a pretrained image or/and pretrained text encoder, low data regime etc.

* ACL Findings 2023 
Viaarxiv icon

Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs

May 29, 2023
Mingyang Zhou, Yi R. Fung, Long Chen, Christopher Thomas, Heng Ji, Shih-Fu Chang

Figure 1 for Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Figure 2 for Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Figure 3 for Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Figure 4 for Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs

Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language(V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains.

* Accepted by Findings of ACL 2023 
Viaarxiv icon

Non-Sequential Graph Script Induction via Multimedia Grounding

May 27, 2023
Yu Zhou, Sha Li, Manling Li, Xudong Lin, Shih-Fu Chang, Mohit Bansal, Heng Ji

Figure 1 for Non-Sequential Graph Script Induction via Multimedia Grounding
Figure 2 for Non-Sequential Graph Script Induction via Multimedia Grounding
Figure 3 for Non-Sequential Graph Script Induction via Multimedia Grounding
Figure 4 for Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose the new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to WikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.

Viaarxiv icon

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

May 24, 2023
Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

Figure 1 for IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Figure 2 for IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Figure 3 for IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Figure 4 for IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

* 13 pages, 5 figures 
Viaarxiv icon

Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Apr 07, 2023
Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang

Figure 1 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering
Figure 2 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering
Figure 3 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering
Figure 4 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with text descriptions as inputs. However, QG models only learn to ask association questions (e.g., ``what is someone doing...'') and result in inferior performance due to the poor transfer of association knowledge to CVidQA, which focuses on causal questions like ``why is someone doing ...''. Observing this, we proposed to exploit causal knowledge to generate question-answer pairs, and proposed a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA. To extract knowledge from LMs, CaKE-LM generates causal questions containing two events with one triggering another (e.g., ``score a goal'' triggers ``soccer player kicking ball'') by prompting LM with the action (soccer player kicking ball) to retrieve the intention (to score a goal). CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets. We also conduct comprehensive analyses and provide key findings for future research.

* CVPR 2023 Workshop L3D-IVU 
Viaarxiv icon

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Mar 29, 2023
Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

Figure 1 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Figure 2 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Figure 3 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Figure 4 for What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.

Viaarxiv icon

Supervised Masked Knowledge Distillation for Few-Shot Transformers

Mar 29, 2023
Han Lin, Guangxing Han, Jiawei Ma, Shiyuan Huang, Xudong Lin, Shih-Fu Chang

Figure 1 for Supervised Masked Knowledge Distillation for Few-Shot Transformers
Figure 2 for Supervised Masked Knowledge Distillation for Few-Shot Transformers
Figure 3 for Supervised Masked Knowledge Distillation for Few-Shot Transformers
Figure 4 for Supervised Masked Knowledge Distillation for Few-Shot Transformers

Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: https://github.com/HL-hanlin/SMKD.

* To appear in CVPR 2023 
Viaarxiv icon

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

Mar 16, 2023
Jiawei Ma, Yulei Niu, Jincheng Xu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

Figure 1 for DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
Figure 2 for DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
Figure 3 for DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
Figure 4 for DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (VOC, COCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes.

* CVPR 2023 Camera Ready (Supp Attached). Code Link: https://github.com/Phoenix-V/DiGeo 
Viaarxiv icon

In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

Jan 06, 2023
Andrew Lu, Xudong Lin, Yulei Niu, Shih-Fu Chang

Figure 1 for In Defense of Structural Symbolic Representation for Video Event-Relation Prediction
Figure 2 for In Defense of Structural Symbolic Representation for Video Event-Relation Prediction
Figure 3 for In Defense of Structural Symbolic Representation for Video Event-Relation Prediction
Figure 4 for In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

Understanding event relationships in videos requires a model to understand the underlying structures of events, i.e., the event type, the associated argument roles, and corresponding entities) along with factual knowledge needed for reasoning. Structural symbolic representation (SSR) based methods directly take event types and associated argument roles/entities as inputs to perform reasoning. However, the state-of-the-art video event-relation prediction system shows the necessity of using continuous feature vectors from input videos; existing methods based solely on SSR inputs fail completely, event when given oracle event types and argument roles. In this paper, we conduct an extensive empirical analysis to answer the following questions: 1) why SSR-based method failed; 2) how to understand the evaluation setting of video event relation prediction properly; 3) how to uncover the potential of SSR-based methods. We first identify the failure of previous SSR-based video event prediction models to be caused by sub-optimal training settings. Surprisingly, we find that a simple SSR-based model with tuned hyperparameters can actually yield a 20\% absolute improvement in macro-accuracy over the state-of-the-art model. Then through qualitative and quantitative analysis, we show how evaluation that takes only video as inputs is currently unfeasible, and the reliance on oracle event information to obtain an accurate evaluation. Based on these findings, we propose to further contextualize the SSR-based model to an Event-Sequence Model and equip it with more factual knowledge through a simple yet effective way of reformulating external visual commonsense knowledge bases into an event-relation prediction pretraining dataset. The resultant new state-of-the-art model eventually establishes a 25\% Macro-accuracy performance boost.

* Work in progress 
Viaarxiv icon