Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei-Hua Li

Customized Visual Storytelling with Unified Multimodal LLMs

Mar 29, 2026

Wei-Hua Li, Cheng Sun, Chu-Song Chen

Abstract:Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.

* Paper accepted to the CVPR 2026 Workshop on Generative AI for Storytelling (CVPRW)

Via

Access Paper or Ask Questions

Relation-Rich Visual Document Generator for Visual Information Extraction

Apr 14, 2025

Zi-Han Jiang, Chien-Wei Lin, Wei-Hua Li, Hsuan-Tung Liu, Yi-Ren Yeh, Chu-Song Chen

Abstract:Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .

* CVPR 2025

Via

Access Paper or Ask Questions

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model

Dec 25, 2024

Yi-Chia Chen, Wei-Hua Li, Chu-Song Chen

Abstract:Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.

* ICIP 2024

Via

Access Paper or Ask Questions

ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning

Oct 10, 2024

Yu-Chen Lin, Wei-Hua Li, Jun-Cheng Chen, Chu-Song Chen

Figure 1 for ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning

Figure 2 for ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning

Figure 3 for ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning

Figure 4 for ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning

Abstract:Prompt Tuning has been a popular Parameter-Efficient Fine-Tuning method attributed to its remarkable performance with few updated parameters on various large-scale pretrained Language Models (PLMs). Traditionally, each prompt has been considered indivisible and updated independently, leading the parameters increase proportionally as prompt length grows. To address this issue, we propose Adaptive Codebook for Composite and Efficient Prompt Tuning (ACCEPT). In our method, we refer to the concept of product quantization (PQ), allowing all soft prompts to share a set of learnable codebook vectors in each subspace, with each prompt differentiated by a set of adaptive weights. We achieve the superior performance on 17 diverse natural language tasks including natural language understanding (NLU) and question answering (QA) tasks by tuning only 0.3% of parameters of the PLMs. Our approach also excels in few-shot and large model settings, highlighting its significant potential.

* EMNLP Finding 2024

Via

Access Paper or Ask Questions