Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiebo Luo

Irony in Emojis: A Comparative Study of Human and LLM Interpretation

Jan 20, 2025

Yawen Zheng, Hanjia Lyu, Jiebo Luo

Abstract:Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o's interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o's performance.

Via

Access Paper or Ask Questions

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Jan 15, 2025

Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei

Figure 1 for Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Figure 2 for Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Figure 3 for Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Figure 4 for Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Abstract:The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

Via

Access Paper or Ask Questions

SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation

Dec 20, 2024

Jiadong Pan, Hongcheng Gao, Liang Li, Zheng-Jun Zha, Qingming Huang, Jiebo Luo

Figure 1 for SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation

Figure 2 for SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation

Figure 3 for SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation

Figure 4 for SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation

Abstract:Diffusion models (DMs) have demonstrated exceptional performance in text-to-image (T2I) tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is improved. However, DMs can generate more harmful images by maliciously guiding the image generation process through CFG. Some safe guidance methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we introduce the Harmful Guidance Redirector (HGR), which redirects harmful CFG direction while preserving clean CFG direction during image generation, transforming CFG into SafeCFG and achieving high safety and quality generation. We train HGR to redirect multiple harmful CFG directions simultaneously, demonstrating its ability to eliminate various harmful elements while preserving high-quality generation. Additionally, we find that HGR can detect image harmfulness, allowing for unsupervised fine-tuning of safe diffusion models without pre-defined clean or harmful labels. Experimental results show that by incorporating HGR, images generated by diffusion models achieve both high quality and strong safety, and safe DMs trained through unsupervised methods according to the harmfulness detected by HGR also exhibit good safety performance. The codes will be publicly available.

Via

Access Paper or Ask Questions

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Dec 11, 2024

Yayun Qi, Hongxi Li, Yiqi Song, Xinxiao Wu, Jiebo Luo

Figure 1 for How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Figure 2 for How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Figure 3 for How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Figure 4 for How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Abstract:The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models and discuss possible solutions, attempting to provide future research directions.

* Under Review

Via

Access Paper or Ask Questions

Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Dec 08, 2024

Zhenghong Zhou, Jie An, Jiebo Luo

Figure 1 for Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Figure 2 for Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Figure 3 for Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Figure 4 for Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Abstract:Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and can disrupt the pre-trained model distribution. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model latent space, ensuring high-quality video generation. Experimental results demonstrate that Latent-Reframe achieves comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.

* Project Page: https://latent-reframe.github.io

Via

Access Paper or Ask Questions

Personalized Multimodal Large Language Models: A Survey

Dec 03, 2024

Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt(+17 more)

Abstract:Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.

Via

Access Paper or Ask Questions

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Nov 26, 2024

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan

Figure 1 for Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Figure 2 for Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Figure 3 for Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Figure 4 for Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Abstract:Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.

* 12 pages, 8 figures

Via

Access Paper or Ask Questions

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Nov 23, 2024

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang, Jianming Zhang, Jiebo Luo

Figure 1 for FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Figure 2 for FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Figure 3 for FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Figure 4 for FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.

* Preprint

Via

Access Paper or Ask Questions

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Nov 22, 2024

Taowen Wang, Dongfang Liu, James Chenhao Liang, Wenhao Yang, Qifan Wang, Cheng Han, Jiebo Luo, Ruixiang Tang

Figure 1 for Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Figure 2 for Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Figure 3 for Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Figure 4 for Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Abstract:Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. While VLA models offer significant capabilities, they also introduce new attack surfaces, making them vulnerable to adversarial attacks. With these vulnerabilities largely unexplored, this paper systematically quantifies the robustness of VLA-based robotic systems. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce an untargeted position-aware attack objective that leverages spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100\% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, this work advances both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for developing robust defense strategies prior to physical-world deployments.

Via

Access Paper or Ask Questions

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Nov 20, 2024

Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, Rongrong Ji

Figure 1 for Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Figure 2 for Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Figure 3 for Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Figure 4 for Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Abstract:Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions