Abstract:Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.
Abstract:Recently, prompt learning has emerged as the state-of-the-art (SOTA) for fair text-to-image (T2I) generation. Specifically, this approach leverages readily available reference images to learn inclusive prompts for each target Sensitive Attribute (tSA), allowing for fair image generation. In this work, we first reveal that this prompt learning-based approach results in degraded sample quality. Our analysis shows that the approach's training objective -- which aims to align the embedding differences of learned prompts and reference images -- could be sub-optimal, resulting in distortion of the learned prompts and degraded generated images. To further substantiate this claim, as our major contribution, we deep dive into the denoising subnetwork of the T2I model to track down the effect of these learned prompts by analyzing the cross-attention maps. In our analysis, we propose a novel prompt switching analysis: I2H and H2I. Furthermore, we propose new quantitative characterization of cross-attention maps. Our analysis reveals abnormalities in the early denoising steps, perpetuating improper global structure that results in degradation in the generated samples. Building on insights from our analysis, we propose two ideas: (i) Prompt Queuing and (ii) Attention Amplification to address the quality issue. Extensive experimental results on a wide range of tSAs show that our proposed method outperforms SOTA approach's image generation quality, while achieving competitive fairness. More resources at FairQueue Project site: https://sutd-visual-computing-group.github.io/FairQueue
Abstract:Cyberbullying harms teenagers' mental health, and teaching them upstanding intervention is crucial. Wizard-of-Oz studies show chatbots can scale up personalized and interactive cyberbullying education, but implementing such chatbots is a challenging and delicate task. We created a no-code chatbot design tool for K-12 teachers. Using large language models and prompt chaining, our tool allows teachers to prototype bespoke dialogue flows and chatbot utterances. In offering this tool, we explore teachers' distinctive needs when designing chatbots to assist their teaching, and how chatbot design tools might better support them. Our findings reveal that teachers welcome the tool enthusiastically. Moreover, they see themselves as playwrights guiding both the students' and the chatbot's behaviors, while allowing for some improvisation. Their goal is to enable students to rehearse both desirable and undesirable reactions to cyberbullying in a safe environment. We discuss the design opportunities LLM-Chains offer for empowering teachers and the research opportunities this work opens up.