Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Sep 19, 2025

Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan(+17 more)

Figure 1 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Figure 2 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Figure 3 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Figure 4 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Share this with someone who'll enjoy it:

Abstract:Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

View paper on

Share this with someone who'll enjoy it:

Title:MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper and Code