Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohit Bansal

Shammie

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Sep 13, 2022

Adyasha Maharana, Darryl Hannan, Mohit Bansal

Figure 1 for StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Figure 2 for StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Figure 3 for StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Figure 4 for StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Abstract:Recent advances in text-to-image synthesis have led to large pretrained transformers with excellent capabilities to generate visualizations from a given text. However, these models are ill-suited for specialized tasks like story visualization, which requires an agent to produce a sequence of images given a corresponding sequence of captions, forming a narrative. Moreover, we find that the story visualization task fails to accommodate generalization to unseen plots and characters in new narratives. Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters. Then, we enhance or 'retro-fit' the pretrained text-to-image synthesis models with task-specific modules for (a) sequential image generation and (b) copying relevant elements from an initial frame. Then, we explore full-model finetuning, as well as prompt-based tuning for parameter-efficient adaptation, of the pre-trained model. We evaluate our approach StoryDALL-E on two existing datasets, PororoSV and FlintstonesSV, and introduce a new dataset DiDeMoSV collected from a video-captioning dataset. We also develop a model StoryGANc based on Generative Adversarial Networks (GAN) for story continuation, and compare it with the StoryDALL-E model to demonstrate the advantages of our approach. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image, thereby improving continuity in the generated visual story. Finally, our analysis suggests that pretrained transformers struggle to comprehend narratives containing several characters. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.

* ECCV 2022 (33 pages; code, data, demo, model card available at https://github.com/adymaharana/storydalle)

Via

Access Paper or Ask Questions

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

Sep 08, 2022

Shiyue Zhang, David Wan, Mohit Bansal

Figure 1 for Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

Figure 2 for Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

Figure 3 for Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

Figure 4 for Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

Abstract:The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful? Turns out that the answer is no. In this work, we define a typology with five types of broad unfaithfulness problems (including and beyond not-entailment) that can appear in extractive summaries, including incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, as well as other misleading information. We ask humans to label these problems out of 1500 English summaries produced by 15 diverse extractive systems. We find that 33% of the summaries have at least one of the five issues. To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment. To remedy this, we propose a new metric, ExtEval, that is designed for detecting unfaithful extractive summaries and is shown to have the best performance. We hope our work can increase the awareness of unfaithfulness problems in extractive summarization and help future work to evaluate and resolve these issues. Our data and code are publicly available at https://github.com/ZhangShiyue/extractive_is_not_faithful

* 19 pages

Via

Access Paper or Ask Questions

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Jul 25, 2022

Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, Roy Schwartz

Figure 1 for WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Figure 2 for WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Figure 3 for WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Figure 4 for WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Abstract:While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations, (e.g., werewolves to a full moon), used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player has to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, aiming to allow future data collection that can be used to develop models with better association abilities.

* Website: https://winogavil.github.io/

Via

Access Paper or Ask Questions

CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

Jul 08, 2022

Hyounghun Kim, Abhay Zala, Mohit Bansal

Figure 1 for CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

Figure 2 for CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

Figure 3 for CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

Figure 4 for CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

Abstract:As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm) which is designed to evaluate the ability of AI systems to reason about scene change imagination. In this task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging counterfactual, scene imagination task. Our code and dataset are publicly available at: https://github.com/hyounghk/CoSIm

* NAACL 2022 (13 pages)

Via

Access Paper or Ask Questions

SETSum: Summarization and Visualization of Student Evaluations of Teaching

Jul 08, 2022

Yinuo Hu, Shiyue Zhang, Viji Sathy, A. T. Panter, Mohit Bansal

Figure 1 for SETSum: Summarization and Visualization of Student Evaluations of Teaching

Figure 2 for SETSum: Summarization and Visualization of Student Evaluations of Teaching

Figure 3 for SETSum: Summarization and Visualization of Student Evaluations of Teaching

Figure 4 for SETSum: Summarization and Visualization of Student Evaluations of Teaching

Abstract:Student Evaluations of Teaching (SETs) are widely used in colleges and universities. Typically SET results are summarized for instructors in a static PDF report. The report often includes summary statistics for quantitative ratings and an unsorted list of open-ended student comments. The lack of organization and summarization of the raw comments hinders those interpreting the reports from fully utilizing informative feedback, making accurate inferences, and designing appropriate instructional improvements. In this work, we introduce a novel system, SETSum, that leverages sentiment analysis, aspect extraction, summarization, and visualization techniques to provide organized illustrations of SET findings to instructors and other reviewers. Ten university professors from diverse departments serve as evaluators of the system and all agree that SETSum helps them interpret SET results more efficiently; and 6 out of 10 instructors prefer our system over the standard static PDF report (while the remaining 4 would like to have both). This demonstrates that our work holds the potential to reform the SET reporting conventions in the future. Our code is available at https://github.com/evahuyn/SETSum

* NAACL 2022 Demo (20 pages)

Via

Access Paper or Ask Questions

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Jul 05, 2022

Jialu Li, Hao Tan, Mohit Bansal

Figure 1 for CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Figure 2 for CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Figure 3 for CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Figure 4 for CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Abstract:Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR

* NAACL 2022 Findings (18 pages)

Via

Access Paper or Ask Questions

Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

Jun 30, 2022

Xiang Zhou, Shiyue Zhang, Mohit Bansal

Figure 1 for Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

Figure 2 for Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

Figure 3 for Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

Figure 4 for Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

Abstract:Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model's ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges. Our code is available at https://github.com/owenzx/MPoSM

* NAACL 2022 (16 pages)

Via

Access Paper or Ask Questions

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Jun 22, 2022

Zhuofan Ying, Peter Hase, Mohit Bansal

Figure 1 for VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Figure 2 for VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Figure 3 for VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Figure 4 for VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Abstract:Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model explanation techniques) with human annotations such as highlights of important image regions. However, recent work has shown that performance gains from feature importance (FI) supervision for Visual Question Answering (VQA) tasks persist even with random supervision, suggesting that these methods do not meaningfully align model FI with human FI. In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility). Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets in terms of both in-distribution and out-of-distribution accuracy. While past work suggests that the mechanism for improved accuracy is through improved explanation plausibility, we show that this relationship depends crucially on explanation faithfulness (whether explanations truly represent the model's internal reasoning). Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful. Lastly, we show that, surprisingly, RRR metrics are not predictive of out-of-distribution model accuracy when controlling for a model's in-distribution accuracy, which calls into question the value of these metrics for evaluating model reasoning. All supporting code is available at https://github.com/zfying/visfis

* 24 pages, 10 figures (First two authors contributed equally)

Via

Access Paper or Ask Questions

Enhanced Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

Jun 15, 2022

Sha Li, Madhi Namazifar, Di Jin, Mohit Bansal, Heng Ji, Yang Liu, Dilek Hakkani-Tur

Figure 1 for Enhanced Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

Figure 2 for Enhanced Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

Figure 3 for Enhanced Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

Figure 4 for Enhanced Knowledge Selection for Grounded Dialogues via Document Semantic Graphs

Abstract:Providing conversation models with background knowledge has been shown to make open-domain dialogues more informative and engaging. Existing models treat knowledge selection as a sentence ranking or classification problem where each sentence is handled individually, ignoring the internal semantic connection among sentences in the background document. In this work, we propose to automatically convert the background knowledge documents into document semantic graphs and then perform knowledge selection over such graphs. Our document semantic graphs preserve sentence-level information through the use of sentence nodes and provide concept connections between sentences. We jointly apply multi-task learning for sentence-level and concept-level knowledge selection and show that it improves sentence-level selection. Our experiments show that our semantic graph-based knowledge selection improves over sentence selection baselines for both the knowledge selection task and the end-to-end response generation task on HollE and improves generalization on unseen topics in WoW.

* NAACL 2022. Please refer to https://www.amazon.science/publications/enhanced-knowledge-selection-for-grounded-dialogues-via-document-semantic-graphs for code and resources

Via

Access Paper or Ask Questions

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Jun 13, 2022

Yi-Lin Sung, Jaemin Cho, Mohit Bansal

Figure 1 for LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Figure 2 for LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Figure 3 for LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Figure 4 for LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Abstract:Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using 2% of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to 30%. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5, CLIP-T5) on both NLP (GLUE) and vision-language (VQA, GQA, NLVR2, MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models (T5-large, T5-3B), attaining better GLUE performance than full fine-tuning and other PETL methods. The exact same trend also holds in our experiments on VL tasks.

* 13 pages; our code is available at: https://github.com/ylsung/Ladder-Side-Tuning

Via

Access Paper or Ask Questions