Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shih-Fu Chang

Columbia University

ENTER: Event Based Interpretable Reasoning for VideoQA

Jan 24, 2025

Hammad Ayyubi, Junzhang Liu, Ali Asgarov, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Zhecan Wang, Chia-Wei Tang, Hani Alomari, Md. Atabuzzaman, Xudong Lin(+3 more)

Figure 1 for ENTER: Event Based Interpretable Reasoning for VideoQA

Figure 2 for ENTER: Event Based Interpretable Reasoning for VideoQA

Figure 3 for ENTER: Event Based Interpretable Reasoning for VideoQA

Figure 4 for ENTER: Event Based Interpretable Reasoning for VideoQA

Abstract:In this paper, we present ENTER, an interpretable Video Question Answering (VideoQA) system based on event graphs. Event graphs convert videos into graphical representations, where video events form the nodes and event-event relationships (temporal/causal/hierarchical) form the edges. This structured representation offers many benefits: 1) Interpretable VideoQA via generated code that parses event-graph; 2) Incorporation of contextual visual information in the reasoning process (code generation) via event graphs; 3) Robust VideoQA via Hierarchical Iterative Update of the event graphs. Existing interpretable VideoQA systems are often top-down, disregarding low-level visual information in the reasoning plan generation, and are brittle. While bottom-up approaches produce responses from visual data, they lack interpretability. Experimental results on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does our method outperform existing top-down approaches while obtaining competitive performance against bottom-up approaches, but more importantly, offers superior interpretability and explainability in the reasoning process.

Via

Access Paper or Ask Questions

PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

Jan 24, 2025

Hammad Ayyubi, Xuande Feng, Junzhang Liu, Xudong Lin, Zhecan Wang, Shih-Fu Chang

Figure 1 for PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

Figure 2 for PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

Figure 3 for PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

Figure 4 for PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

Abstract:The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can't be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets -- TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.

* NAACL 2025 Findings

Via

Access Paper or Ask Questions

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

May 28, 2024

Jiawei Ma, Yulei Niu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

Figure 1 for WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Figure 2 for WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Figure 3 for WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Figure 4 for WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Abstract:Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation, by only leveraging data in a single domain and without any test prior. Specifically, for each image, we first estimate the language embedding with fine-grained alignment, which can be consequently used to adaptively identify and then remove domain-specific counterpart from the raw visual embedding. WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT. Experimental studies on three domain generalization datasets demonstrate the effectiveness of our approach.

Via

Access Paper or Ask Questions

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

May 23, 2024

Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas, Rui Sun, Shih-Fu Chang, Kai-Wei Chang

Figure 1 for Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Figure 2 for Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Figure 3 for Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Figure 4 for Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Abstract:Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context. Training models on such data foster biased learning and hallucinations as models tend to make similar unwarranted assumptions. To address this issue, we collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. Strong improvements across multiple benchmarks demonstrate the effectiveness of our approach. Further, we develop a general-purpose Context-AwaRe Abstention (CARA) detector to identify samples lacking sufficient context and enhance model accuracy by abstaining from responding if the required context is absent. CARA exhibits generalization to new benchmarks it wasn't trained on, underscoring its utility for future VLU benchmarks in detecting or cleaning samples with inadequate context. Finally, we curate a Context Ambiguity and Sufficiency Evaluation (CASE) set to benchmark the performance of insufficient context detectors. Overall, our work represents a significant advancement in ensuring that vision-language models generate trustworthy and evidence-based outputs in complex real-world scenarios.

Via

Access Paper or Ask Questions

MoDE: CLIP Data Experts via Clustering

Apr 24, 2024

Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu

Figure 1 for MoDE: CLIP Data Experts via Clustering

Figure 2 for MoDE: CLIP Data Experts via Clustering

Figure 3 for MoDE: CLIP Data Experts via Clustering

Figure 4 for MoDE: CLIP Data Experts via Clustering

Abstract:The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

* IEEE CVPR 2024 Camera Ready. Code Link: https://github.com/facebookresearch/MetaCLIP/tree/main/mode

Via

Access Paper or Ask Questions

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Apr 11, 2024

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan(+1 more)

Figure 1 for Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Figure 2 for Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Figure 3 for Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Figure 4 for Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Abstract:While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

* Preprint. 14 pages, 4 figures

Via

Access Paper or Ask Questions

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Mar 25, 2024

Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji

Abstract:Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.

Via

Access Paper or Ask Questions

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Mar 03, 2024

Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang

Figure 1 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Figure 2 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Figure 3 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Figure 4 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Abstract:We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.

* Accepted by ICLR 2024

Via

Access Paper or Ask Questions

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Dec 15, 2023

Kung-Hsiang Huang, Mingyang Zhou, Hou Pong Chan, Yi R. Fung, Zhenhailong Wang, Lingyu Zhang, Shih-Fu Chang, Heng Ji

Figure 1 for Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Figure 2 for Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Figure 3 for Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Figure 4 for Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Abstract:Recent advancements in large vision-language models (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications. This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFEC, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

Via

Access Paper or Ask Questions

Video Summarization: Towards Entity-Aware Captions

Dec 01, 2023

Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

Figure 1 for Video Summarization: Towards Entity-Aware Captions

Figure 2 for Video Summarization: Towards Entity-Aware Captions

Figure 3 for Video Summarization: Towards Entity-Aware Captions

Figure 4 for Video Summarization: Towards Entity-Aware Captions

Abstract:Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

Via

Access Paper or Ask Questions