Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fabio Quattrini

Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

Feb 09, 2026

Carmine Zaccagnino, Fabio Quattrini, Enis Simsar, Marta Tintoré Gazulla, Rita Cucchiara, Alessio Tonioni, Silvia Cascianelli

Abstract:Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

Via

Access Paper or Ask Questions

μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Aug 28, 2024

Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara

Figure 1 for μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Figure 2 for μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Figure 3 for μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Figure 4 for μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Abstract:Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose {\mu}gat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum.

* Accepted at ECCV Workshop "AI4DH: Artificial Intelligence for Digital Humanities"

Via

Access Paper or Ask Questions

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Aug 28, 2024

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Figure 1 for Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Figure 2 for Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Figure 3 for Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Figure 4 for Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Abstract:Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.

* Accepted at ECCV 2024

Via

Access Paper or Ask Questions

Alfie: Democratising RGBA Image Generation With No $$$

Aug 27, 2024

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Figure 1 for Alfie: Democratising RGBA Image Generation With No $$$

Figure 2 for Alfie: Democratising RGBA Image Generation With No $$$

Figure 3 for Alfie: Democratising RGBA Image Generation With No $$$

Figure 4 for Alfie: Democratising RGBA Image Generation With No $$$

Abstract:Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at https://github.com/aimagelab/Alfie.

* Accepted at ECCV AI for Visual Arts Workshop and Challenges

Via

Access Paper or Ask Questions

Binarizing Documents by Leveraging both Space and Frequency

Apr 26, 2024

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Abstract:Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.

* Accepted at ICDAR2024

Via

Access Paper or Ask Questions

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Oct 31, 2023

Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Rita Cucchiara

Figure 1 for HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Figure 2 for HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Figure 3 for HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Figure 4 for HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Abstract:Styled Handwritten Text Generation (Styled HTG) is an important task in document analysis, aiming to generate text images with the handwriting of given reference images. In recent years, there has been significant progress in the development of deep learning models for tackling this task. Being able to measure the performance of HTG models via a meaningful and representative criterion is key for fostering the development of this research topic. However, despite the current adoption of scores for natural image generation evaluation, assessing the quality of generated handwriting remains challenging. In light of this, we devise the Handwriting Distance (HWD), tailored for HTG evaluation. In particular, it works in the feature space of a network specifically trained to extract handwriting style features from the variable-lenght input images and exploits a perceptual distance to compare the subtle geometric features of handwriting. Through extensive experimental evaluation on different word-level and line-level datasets of handwritten text images, we demonstrate the suitability of the proposed HWD as a score for Styled HTG. The pretrained model used as backbone will be released to ease the adoption of the score, aiming to provide a valuable tool for evaluating HTG models and thus contributing to advancing this important research area.

* Accepted at BMVC2023

Via

Access Paper or Ask Questions

Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Aug 09, 2023

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Figure 1 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Figure 2 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Figure 3 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Figure 4 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Abstract:Recent advancements in Digital Document Restoration (DDR) have led to significant breakthroughs in analyzing highly damaged written artifacts. Among those, there has been an increasing interest in applying Artificial Intelligence techniques for virtually unwrapping and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of documents, which have been digitized via X-ray tomography to allow the development of ad-hoc deep learning-based DDR solutions. In this work, we propose a modification of the Fast Fourier Convolution operator for volumetric data and apply it in a segmentation architecture for ink detection on the challenging Herculaneum papyri, demonstrating its suitability via deep experimental analysis. To encourage the research on this task and the application of the proposed operator to other tasks involving volumetric data, we will release our implementation (https://github.com/aimagelab/vffc)

* Accepted at the 4th ICCV Workshop on e-Heritage (in conjunction with ICCV 2023)

Via

Access Paper or Ask Questions