Alert button
Picture for Rita Cucchiara

Rita Cucchiara

Alert button

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Sep 11, 2023
Giuseppe Cartella, Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.

* International Conference on Image Analysis and Processing (ICIAP) 2023 
Viaarxiv icon

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Aug 23, 2023
Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Figure 1 for With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning
Figure 2 for With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning
Figure 3 for With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning
Figure 4 for With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net.

* ICCV 2023 
Viaarxiv icon

TrackFlow: Multi-Object Tracking with Normalizing Flows

Aug 22, 2023
Gianluca Mancusi, Aniello Panariello, Angelo Porrello, Matteo Fabbri, Simone Calderara, Rita Cucchiara

Figure 1 for TrackFlow: Multi-Object Tracking with Normalizing Flows
Figure 2 for TrackFlow: Multi-Object Tracking with Normalizing Flows
Figure 3 for TrackFlow: Multi-Object Tracking with Normalizing Flows
Figure 4 for TrackFlow: Multi-Object Tracking with Normalizing Flows

The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.

* Accepted at ICCV 2023 
Viaarxiv icon

Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Aug 09, 2023
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Figure 1 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri
Figure 2 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri
Figure 3 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri
Figure 4 for Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Recent advancements in Digital Document Restoration (DDR) have led to significant breakthroughs in analyzing highly damaged written artifacts. Among those, there has been an increasing interest in applying Artificial Intelligence techniques for virtually unwrapping and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of documents, which have been digitized via X-ray tomography to allow the development of ad-hoc deep learning-based DDR solutions. In this work, we propose a modification of the Fast Fourier Convolution operator for volumetric data and apply it in a segmentation architecture for ink detection on the challenging Herculaneum papyri, demonstrating its suitability via deep experimental analysis. To encourage the research on this task and the application of the proposed operator to other tasks involving volumetric data, we will release our implementation (https://github.com/aimagelab/vffc)

* Accepted at the 4th ICCV Workshop on e-Heritage (in conjunction with ICCV 2023) 
Viaarxiv icon

CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Jul 24, 2023
Davide Di Nucci, Alessandro Simoni, Matteo Tomei, Luca Ciuffreda, Roberto Vezzani, Rita Cucchiara

Figure 1 for CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components
Figure 2 for CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components
Figure 3 for CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components
Figure 4 for CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and scenes derived from sets of images. Despite their efficiency, NeRF models can pose challenges in certain scenarios such as vehicle inspection, where the lack of sufficient data or the presence of challenging elements (e.g. reflections) strongly impact the accuracy of the reconstruction. To this aim, we introduce CarPatch, a novel synthetic benchmark of vehicles. In addition to a set of images annotated with their intrinsic and extrinsic camera parameters, the corresponding depth maps and semantic segmentation masks have been generated for each view. Global and part-based metrics have been defined and used to evaluate, compare, and better characterize some state-of-the-art techniques. The dataset is publicly released at https://aimagelab.ing.unimore.it/go/carpatch and can be used as an evaluation guide and as a baseline for future work on this challenging topic.

* Accepted at ICIAP2023 
Viaarxiv icon

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Jul 19, 2023
Federico Betti, Jacopo Staiano, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe

Figure 1 for Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation
Figure 2 for Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation
Figure 3 for Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation
Figure 4 for Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content and the adherence with the prompted requests: so far, only human-based evaluations have been adopted for quality satisfaction and for comparing different generative methods. We introduce a novel automated method for Visual Concept Evaluation (ViCE), i.e. to assess consistency between a generated/edited image and the corresponding prompt/instructions, with a process inspired by the human cognitive behaviour. ViCE combines the strengths of Large Language Models (LLMs) and Visual Question Answering (VQA) into a unified pipeline, aiming to replicate the human cognitive process in quality assessment. This method outlines visual concepts, formulates image-specific verification questions, utilizes the Q&A system to investigate the image, and scores the combined outcome. Although this brave new hypothesis of mimicking humans in the image evaluation process is in its preliminary assessment stage, results are promising and open the door to a new form of automatic evaluation which could have significant impact as the image generation or the image target editing tasks become more and more sophisticated.

* Accepted as oral at ACM MultiMedia 2023 (Brave New Ideas track) 
Viaarxiv icon

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

Jun 12, 2023
Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara

Figure 1 for Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
Figure 2 for Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
Figure 3 for Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
Figure 4 for Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed $k$-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. Source code and trained models are publicly available at: https://github.com/aimagelab/MaPeT.

Viaarxiv icon

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

May 22, 2023
Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

Figure 1 for LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
Figure 2 for LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
Figure 3 for LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
Figure 4 for LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models will be publicly released at: https://github.com/miccunifi/ladi-vton.

Viaarxiv icon

How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

May 04, 2023
Vittorio Pippi, Silvia Cascianelli, Christopher Kermorvant, Rita Cucchiara

Figure 1 for How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning
Figure 2 for How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning
Figure 3 for How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning
Figure 4 for How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.

* Accepted at ICDAR2023 
Viaarxiv icon

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Apr 04, 2023
Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

Figure 1 for Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing
Figure 2 for Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing
Figure 3 for Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing
Figure 4 for Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations will be publicly released at: https://github.com/aimagelab/multimodal-garment-designer.

Viaarxiv icon