Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiebo Luo

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

May 31, 2023

Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, Jiebo Luo

Abstract:To increase the generalization capability of VQA systems, many recent studies have tried to de-bias spurious language or vision associations that shortcut the question or image to the answer. Despite these efforts, the literature fails to address the confounding effect of vision and language simultaneously. As a result, when they reduce bias learned from one modality, they usually increase bias from the other. In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect. The model trained in this strategy can concurrently and efficiently reduce vision and language bias. To the best of our knowledge, this is the first work to reduce biases resulting from confounding effects of vision and language in VQA, leveraging causal explain-away relations. We accompany our method with an explain-away strategy, pushing the accuracy of the questions with numerical answers results compared to existing methods that have been an open problem. The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.

* 22 pages

Via

Access Paper or Ask Questions

Learning to Evaluate the Artness of AI-generated Images

May 08, 2023

Junyu Chen, Jie An, Hanjia Lyu, Jiebo Luo

Figure 1 for Learning to Evaluate the Artness of AI-generated Images

Figure 2 for Learning to Evaluate the Artness of AI-generated Images

Figure 3 for Learning to Evaluate the Artness of AI-generated Images

Figure 4 for Learning to Evaluate the Artness of AI-generated Images

Abstract:Assessing the artness of AI-generated images continues to be a challenge within the realm of image generation. Most existing metrics cannot be used to perform instance-level and reference-free artness evaluation. This paper presents ArtScore, a metric designed to evaluate the degree to which an image resembles authentic artworks by artists (or conversely photographs), thereby offering a novel approach to artness assessment. We first blend pre-trained models for photo and artwork generation, resulting in a series of mixed models. Subsequently, we utilize these mixed models to generate images exhibiting varying degrees of artness with pseudo-annotations. Each photorealistic image has a corresponding artistic counterpart and a series of interpolated images that range from realistic to artistic. This dataset is then employed to train a neural network that learns to estimate quantized artness levels of arbitrary images. Extensive experiments reveal that the artness levels predicted by ArtScore align more closely with human artistic evaluation than existing evaluation metrics, such as Gram loss and ArtFID.

Via

Access Paper or Ask Questions

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Apr 18, 2023

Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, Xi Yin

Figure 1 for Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Figure 2 for Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Figure 3 for Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Figure 4 for Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Abstract:We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.

* https://latent-shift.github.io

Via

Access Paper or Ask Questions

Meta-causal Learning for Single Domain Generalization

Apr 07, 2023

Jin Chen, Zhi Gao, Xinxiao Wu, Jiebo Luo

Figure 1 for Meta-causal Learning for Single Domain Generalization

Figure 2 for Meta-causal Learning for Single Domain Generalization

Figure 3 for Meta-causal Learning for Single Domain Generalization

Figure 4 for Meta-causal Learning for Single Domain Generalization

Abstract:Single domain generalization aims to learn a model from a single training domain (source domain) and apply it to multiple unseen test domains (target domains). Existing methods focus on expanding the distribution of the training domain to cover the target domains, but without estimating the domain shift between the source and target domains. In this paper, we propose a new learning paradigm, namely simulate-analyze-reduce, which first simulates the domain shift by building an auxiliary domain as the target domain, then learns to analyze the causes of domain shift, and finally learns to reduce the domain shift for model adaptation. Under this paradigm, we propose a meta-causal learning method to learn meta-knowledge, that is, how to infer the causes of domain shift between the auxiliary and source domains during training. We use the meta-knowledge to analyze the shift between the target and source domains during testing. Specifically, we perform multiple transformations on source data to generate the auxiliary domain, perform counterfactual inference to learn to discover the causal factors of the shift between the auxiliary and source domains, and incorporate the inferred causality into factor-aware domain alignments. Extensive experiments on several benchmarks of image classification show the effectiveness of our method.

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions

Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning

Mar 28, 2023

Jingyang Lin, Junyu Chen, Hanjia Lyu, Igor Khodak, Divya Chhabra, Colby L Day Richardson, Irina Prelipcean, Andrew M Dylag, Jiebo Luo

Figure 1 for Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning

Figure 2 for Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning

Figure 3 for Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning

Figure 4 for Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning

Abstract:Diagnosis of adverse neonatal outcomes is crucial for preterm survival since it enables doctors to provide timely treatment. Machine learning (ML) algorithms have been demonstrated to be effective in predicting adverse neonatal outcomes. However, most previous ML-based methods have only focused on predicting a single outcome, ignoring the potential correlations between different outcomes, and potentially leading to suboptimal results and overfitting issues. In this work, we first analyze the correlations between three adverse neonatal outcomes and then formulate the diagnosis of multiple neonatal outcomes as a multi-task learning (MTL) problem. We then propose an MTL framework to jointly predict multiple adverse neonatal outcomes. In particular, the MTL framework contains shared hidden layers and multiple task-specific branches. Extensive experiments have been conducted using Electronic Health Records (EHRs) from 121 preterm neonates. Empirical results demonstrate the effectiveness of the MTL framework. Furthermore, the feature importance is analyzed for each neonatal outcome, providing insights into model interpretability.

Via

Access Paper or Ask Questions

Bias or Diversity? Unraveling Semantic Discrepancy in U.S. News Headlines

Mar 28, 2023

Jinsheng Pan, Weihong Qi, Zichen Wang, Hanjia Lyu, Jiebo Luo

Figure 1 for Bias or Diversity? Unraveling Semantic Discrepancy in U.S. News Headlines

Figure 2 for Bias or Diversity? Unraveling Semantic Discrepancy in U.S. News Headlines

Figure 3 for Bias or Diversity? Unraveling Semantic Discrepancy in U.S. News Headlines

Figure 4 for Bias or Diversity? Unraveling Semantic Discrepancy in U.S. News Headlines

Abstract:There is a broad consensus that news media outlets incorporate ideological biases in their news articles. However, prior studies on measuring the discrepancies among media outlets and further dissecting the origins of semantic differences suffer from small sample sizes and limited scope. In this study, we collect a large dataset of 1.8 million news headlines from major U.S. media outlets spanning from 2014 to 2022 to thoroughly track and dissect the semantic discrepancy in U.S. news media. We employ multiple correspondence analysis (MCA) to quantify the semantic discrepancy relating to four prominent topics - domestic politics, economic issues, social issues, and foreign affairs. Additionally, we compare the most frequent n-grams in media headlines to provide further qualitative insights into our analysis. Our findings indicate that on domestic politics and social issues, the discrepancy can be attributed to a certain degree of media bias. Meanwhile, the discrepancy in reporting foreign affairs is largely attributed to the diversity in individual journalistic styles. Finally, U.S. media outlets show consistency and high similarity in their coverage of economic issues.

Via

Access Paper or Ask Questions

Human Behavior in the Time of COVID-19: Learning from Big Data

Mar 23, 2023

Hanjia Lyu, Arsal Imtiaz, Yufei Zhao, Jiebo Luo

Abstract:Since the World Health Organization (WHO) characterized COVID-19 as a pandemic in March 2020, there have been over 600 million confirmed cases of COVID-19 and more than six million deaths as of October 2022. The relationship between the COVID-19 pandemic and human behavior is complicated. On one hand, human behavior is found to shape the spread of the disease. On the other hand, the pandemic has impacted and even changed human behavior in almost every aspect. To provide a holistic understanding of the complex interplay between human behavior and the COVID-19 pandemic, researchers have been employing big data techniques such as natural language processing, computer vision, audio signal processing, frequent pattern mining, and machine learning. In this study, we present an overview of the existing studies on using big data techniques to study human behavior in the time of the COVID-19 pandemic. In particular, we categorize these studies into three groups - using big data to measure, model, and leverage human behavior, respectively. The related tasks, data, and methods are summarized accordingly. To provide more insights into how to fight the COVID-19 pandemic and future global catastrophes, we further discuss challenges and potential opportunities.

* Accepted for publication in the Horizons in Big Data 2022 article collection of Frontiers in Big Data

Via

Access Paper or Ask Questions

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Mar 21, 2023

Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo

Figure 1 for VideoXum: Cross-modal Visual and Textural Summarization of Videos

Figure 2 for VideoXum: Cross-modal Visual and Textural Summarization of Videos

Figure 3 for VideoXum: Cross-modal Visual and Textural Summarization of Videos

Figure 4 for VideoXum: Cross-modal Visual and Textural Summarization of Videos

Abstract:Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

Via

Access Paper or Ask Questions

Spatial-Aware Token for Weakly Supervised Object Localization

Mar 18, 2023

Pingyu Wu, Wei Zhai, Yang Cao, Jiebo Luo, Zheng-Jun Zha

Figure 1 for Spatial-Aware Token for Weakly Supervised Object Localization

Figure 2 for Spatial-Aware Token for Weakly Supervised Object Localization

Figure 3 for Spatial-Aware Token for Weakly Supervised Object Localization

Figure 4 for Spatial-Aware Token for Weakly Supervised Object Localization

Abstract:Weakly supervised object localization (WSOL) is a challenging task aiming to localize objects with only image-level supervision. Recent works apply visual transformer to WSOL and achieve significant success by exploiting the long-range feature dependency in self-attention mechanism. However, existing transformer-based methods synthesize the classification feature maps as the localization map, which leads to optimization conflicts between classification and localization tasks. To address this problem, we propose to learn a task-specific spatial-aware token (SAT) to condition localization in a weakly supervised manner. Specifically, a spatial token is first introduced in the input space to aggregate representations for localization task. Then a spatial aware attention module is constructed, which allows spatial token to generate foreground probabilities of different patches by querying and to extract localization knowledge from the classification task. Besides, for the problem of sparse and unbalanced pixel-level supervision obtained from the image-level label, two spatial constraints, including batch area loss and normalization loss, are designed to compensate and enhance this supervision. Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the extreme setting of using only 1 image per class from ImageNet for training, SAT already exceeds the SOTA method by 2.1% GT-known Loc. Code and models are available at https://github.com/wpy1999/SAT.

* Code: https://github.com/wpy1999/SAT

Via

Access Paper or Ask Questions

Grounding 3D Object Affordance from 2D Interactions in Images

Mar 18, 2023

Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo Luo, Zheng-Jun Zha

Abstract:Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space, which serves as a link between perception and operation for embodied agents. Existing studies primarily focus on connecting visual affordances with geometry structures, e.g. relying on annotations to declare interactive regions of interest on the object and establishing a mapping between the regions and affordances. However, the essence of learning object affordance is to understand how to use it, and the manner that detaches interactions is limited in generalization. Normally, humans possess the ability to perceive object affordances in the physical world through demonstration images or videos. Motivated by this, we introduce a novel task setting: grounding 3D object affordance from 2D interactions in images, which faces the challenge of anticipating affordance through interactions of different sources. To address this problem, we devise a novel Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources and models the interactive contexts for 3D object affordance grounding. Besides, we collect a Point-Image Affordance Dataset (PIAD) to support the proposed task. Comprehensive experiments on PIAD demonstrate the reliability of the proposed task and the superiority of our method. The project is available at https://github.com/yyvhang/IAGNet.

Via

Access Paper or Ask Questions