Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sizhong Qin

AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Mar 30, 2026

Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang

Abstract:Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.

* Accepted by CVPR 2026

Via

Access Paper or Ask Questions

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Mar 12, 2026

Sizhong Qin, Ramon Elias Weber, Xinzheng Lu

Abstract:Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

* 20 pages, 9 figures. Accepted to CVPR 2026

Via

Access Paper or Ask Questions

ChatHouseDiffusion: Prompt-Guided Generation and Editing of Floor Plans

Oct 15, 2024

Sizhong Qin, Chengyu He, Qiaoyun Chen, Sen Yang, Wenjie Liao, Yi Gu, Xinzheng Lu

Figure 1 for ChatHouseDiffusion: Prompt-Guided Generation and Editing of Floor Plans

Figure 2 for ChatHouseDiffusion: Prompt-Guided Generation and Editing of Floor Plans

Figure 3 for ChatHouseDiffusion: Prompt-Guided Generation and Editing of Floor Plans

Figure 4 for ChatHouseDiffusion: Prompt-Guided Generation and Editing of Floor Plans

Abstract:The generation and editing of floor plans are critical in architectural planning, requiring a high degree of flexibility and efficiency. Existing methods demand extensive input information and lack the capability for interactive adaptation to user modifications. This paper introduces ChatHouseDiffusion, which leverages large language models (LLMs) to interpret natural language input, employs graphormer to encode topological relationships, and uses diffusion models to flexibly generate and edit floor plans. This approach allows iterative design adjustments based on user ideas, significantly enhancing design efficiency. Compared to existing models, ChatHouseDiffusion achieves higher Intersection over Union (IoU) scores, permitting precise, localized adjustments without the need for complete redesigns, thus offering greater practicality. Experiments demonstrate that our model not only strictly adheres to user specifications but also facilitates a more intuitive design process through its interactive capabilities.

Via

Access Paper or Ask Questions