Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiebo Luo

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Oct 11, 2023

Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo

Figure 1 for OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Figure 2 for OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Figure 3 for OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Figure 4 for OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

Abstract:This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.

Via

Access Paper or Ask Questions

Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

Sep 18, 2023

Jinsheng Pan, Zichen Wang, Weihong Qi, Hanjia Lyu, Jiebo Luo

Figure 1 for Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

Figure 2 for Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

Figure 3 for Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

Figure 4 for Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

Abstract:Understanding the framing of political issues is of paramount importance as it significantly shapes how individuals perceive, interpret, and engage with these matters. While prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media's treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.

Via

Access Paper or Ask Questions

SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Aug 17, 2023

Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, Zhiyong Wang

Figure 1 for SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Figure 2 for SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Figure 3 for SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Figure 4 for SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Abstract:The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to poor generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM's pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code will be released at https://github.com/wenxi-yue/SurgicalSAM.

* Technical Report. The source code will be released at https://github.com/wenxi-yue/SurgicalSAM

Via

Access Paper or Ask Questions

LLM-Rec: Personalized Recommendation via Prompting Large Language Models

Aug 16, 2023

Hanjia Lyu, Song Jiang, Hanqing Zeng, Qifan Wang, Si Zhang, Ren Chen, Chris Leung, Jiajie Tang, Yinglong Xia, Jiebo Luo

Figure 1 for LLM-Rec: Personalized Recommendation via Prompting Large Language Models

Figure 2 for LLM-Rec: Personalized Recommendation via Prompting Large Language Models

Figure 3 for LLM-Rec: Personalized Recommendation via Prompting Large Language Models

Figure 4 for LLM-Rec: Personalized Recommendation via Prompting Large Language Models

Abstract:We investigate various prompting strategies for enhancing personalized recommendation performance with large language models (LLMs) through input augmentation. Our proposed approach, termed LLM-Rec, encompasses four distinct prompting strategies: (1) basic prompting, (2) recommendation-driven prompting, (3) engagement-guided prompting, and (4) recommendation-driven + engagement-guided prompting. Our empirical experiments show that incorporating the augmented input text generated by LLM leads to improved recommendation performance. Recommendation-driven and engagement-guided prompting strategies are found to elicit LLM's understanding of global and local item characteristics. This finding highlights the importance of leveraging diverse prompts and input augmentation techniques to enhance the recommendation capabilities with LLMs.

Via

Access Paper or Ask Questions

Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Aug 14, 2023

Alexander Martin, Haitian Zheng, Jie An, Jiebo Luo

Figure 1 for Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Figure 2 for Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Figure 3 for Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Figure 4 for Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation

Abstract:With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.

* 9 pages, 10 figures, ACM Multimedia 2023

Via

Access Paper or Ask Questions

User-Controllable Recommendation via Counterfactual Retrospective and Prospective Explanations

Aug 02, 2023

Juntao Tan, Yingqiang Ge, Yan Zhu, Yinglong Xia, Jiebo Luo, Jianchao Ji, Yongfeng Zhang

Figure 1 for User-Controllable Recommendation via Counterfactual Retrospective and Prospective Explanations

Figure 2 for User-Controllable Recommendation via Counterfactual Retrospective and Prospective Explanations

Figure 3 for User-Controllable Recommendation via Counterfactual Retrospective and Prospective Explanations

Figure 4 for User-Controllable Recommendation via Counterfactual Retrospective and Prospective Explanations

Abstract:Modern recommender systems utilize users' historical behaviors to generate personalized recommendations. However, these systems often lack user controllability, leading to diminished user satisfaction and trust in the systems. Acknowledging the recent advancements in explainable recommender systems that enhance users' understanding of recommendation mechanisms, we propose leveraging these advancements to improve user controllability. In this paper, we present a user-controllable recommender system that seamlessly integrates explainability and controllability within a unified framework. By providing both retrospective and prospective explanations through counterfactual reasoning, users can customize their control over the system by interacting with these explanations. Furthermore, we introduce and assess two attributes of controllability in recommendation systems: the complexity of controllability and the accuracy of controllability. Experimental evaluations on MovieLens and Yelp datasets substantiate the effectiveness of our proposed framework. Additionally, our experiments demonstrate that offering users control options can potentially enhance recommendation accuracy in the future. Source code and data are available at \url{https://github.com/chrisjtan/ucr}.

* Accepted for presentation at 26th European Conference on Artificial Intelligence (ECAI2023)

Via

Access Paper or Ask Questions

MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

Jul 31, 2023

Junchen Zhu, Huan Yang, Wenjing Wang, Huiguo He, Zixi Tuo, Yongsheng Yu, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu(+1 more)

Figure 1 for MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

Figure 2 for MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

Abstract:Videos for mobile devices become the most popular access to share and acquire information recently. For the convenience of users' creation, in this paper, we present a system, namely MobileVidFactory, to automatically generate vertical mobile videos where users only need to give simple texts mainly. Our system consists of two parts: basic and customized generation. In the basic generation, we take advantage of the pretrained image diffusion model, and adapt it to a high-quality open-domain vertical video generator for mobile devices. As for the audio, by retrieving from our big database, our system matches a suitable background sound for the video. Additionally to produce customized content, our system allows users to add specified screen texts to the video for enriching visual expression, and specify texts for automatic reading with optional voices as they like.

* Accepted by ACM-MM 2023 demo

Via

Access Paper or Ask Questions

Domain-Scalable Unpaired Image Translation via Latent Space Anchoring

Jun 26, 2023

Siyu Huang, Jie An, Donglai Wei, Zudi Lin, Jiebo Luo, Hanspeter Pfister

Figure 1 for Domain-Scalable Unpaired Image Translation via Latent Space Anchoring

Figure 2 for Domain-Scalable Unpaired Image Translation via Latent Space Anchoring

Figure 3 for Domain-Scalable Unpaired Image Translation via Latent Space Anchoring

Figure 4 for Domain-Scalable Unpaired Image Translation via Latent Space Anchoring

Abstract:Unpaired image-to-image translation (UNIT) aims to map images between two visual domains without paired training data. However, given a UNIT model trained on certain domains, it is difficult for current methods to incorporate new domains because they often need to train the full model on both existing and new domains. To address this problem, we propose a new domain-scalable UNIT method, termed as latent space anchoring, which can be efficiently extended to new visual domains and does not need to fine-tune encoders and decoders of existing domains. Our method anchors images of different domains to the same latent space of frozen GANs by learning lightweight encoder and regressor models to reconstruct single-domain images. In the inference phase, the learned encoders and decoders of different domains can be arbitrarily combined to translate images between any two domains without fine-tuning. Experiments on various datasets show that the proposed method achieves superior performance on both standard and domain-scalable UNIT tasks in comparison with the state-of-the-art methods.

* Accepeted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Code is available at https://github.com/siyuhuang/Latent-Space-Anchoring

Via

Access Paper or Ask Questions

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Jun 05, 2023

Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, Mingyu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng(+17 more)

Abstract:Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extraction from Visually-Rich Document images (SVRD). We set up two tracks for SVRD including Track 1: HUST-CELL and Track 2: Baidu-FEST, where HUST-CELL aims to evaluate the end-to-end performance of Complex Entity Linking and Labeling, and Baidu-FEST focuses on evaluating the performance and generalization of Zero-shot / Few-shot Structured Text extraction from an end-to-end perspective. Compared to the current document benchmarks, our two tracks of competition benchmark enriches the scenarios greatly and contains more than 50 types of visually-rich document images (mainly from the actual enterprise applications). The competition opened on 30th December, 2022 and closed on 24th March, 2023. There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2. In this report we will presents the motivation, competition datasets, task definition, evaluation protocol, and submission summaries. According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios. It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.

* ICDAR 2023 Competition on SVRD report (To be appear in ICDAR 2023)

Via

Access Paper or Ask Questions

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

May 31, 2023

Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, Jiebo Luo

Abstract:To increase the generalization capability of VQA systems, many recent studies have tried to de-bias spurious language or vision associations that shortcut the question or image to the answer. Despite these efforts, the literature fails to address the confounding effect of vision and language simultaneously. As a result, when they reduce bias learned from one modality, they usually increase bias from the other. In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect. The model trained in this strategy can concurrently and efficiently reduce vision and language bias. To the best of our knowledge, this is the first work to reduce biases resulting from confounding effects of vision and language in VQA, leveraging causal explain-away relations. We accompany our method with an explain-away strategy, pushing the accuracy of the questions with numerical answers results compared to existing methods that have been an open problem. The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.

* 22 pages

Via

Access Paper or Ask Questions