Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fulvio Sanguigni

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

Mar 23, 2026

Fulvio Sanguigni, Davide Lobba, Bin Ren, Marcella Cornia, Nicu Sebe, Rita Cucchiara

Abstract:Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.

Via

Access Paper or Ask Questions

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

May 27, 2025

Davide Lobba, Fulvio Sanguigni, Bin Ren, Marcella Cornia, Rita Cucchiara, Nicu Sebe

Figure 1 for Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Figure 2 for Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Figure 3 for Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Figure 4 for Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Abstract:While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format -- typically a flat, lay-down-style representation of the garment -- making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.

Via

Access Paper or Ask Questions

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Apr 18, 2025

Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara

Figure 1 for Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Figure 2 for Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Figure 3 for Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Figure 4 for Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Abstract:In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.

* IJCNN 2025

Via

Access Paper or Ask Questions

Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

Nov 10, 2023

Fulvio Sanguigni, Mikolaj Czerkawski, Lorenzo Papa, Irene Amerini, Bertrand Le Saux

Figure 1 for Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

Figure 2 for Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

Figure 3 for Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

Figure 4 for Diffusion Models for Earth Observation Use-cases: from cloud removal to urban change detection

Abstract:The advancements in the state of the art of generative Artificial Intelligence (AI) brought by diffusion models can be highly beneficial in novel contexts involving Earth observation data. After introducing this new family of generative models, this work proposes and analyses three use cases which demonstrate the potential of diffusion-based approaches for satellite image data. Namely, we tackle cloud removal and inpainting, dataset generation for change-detection tasks, and urban replanning.

* Proceedings of the 2023 conference on Big Data from Space, Soille, P., Lumnitz, S. and Albani, S. editor(s), Publications Office of the European Union, Luxembourg, 2023
* Presented at Big Data from Space 2023 (BiDS)

Via

Access Paper or Ask Questions