Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eleonora Grassucci

Trimodal Glioma Representation Alignment via Volumetric Contrastive Learning

Jun 12, 2026

Denise Marini, Eleonora Grassucci, Danilo Comminiello

Abstract:Glioma grading and survival prediction require the integration of heterogeneous information collected at different spatial and biological scales. Histopathology describes tissue morphology, mRNA expression captures molecular activity, and magnetic resonance imaging provides a non-invasive view of tumor extent and radiological heterogeneity. Existing glioma prognosis models often combine only two of these sources, while their alignment objectives remain mostly pairwise. This paper introduces GLORIA, a novel trimodal framework for GLioma Omics - Radiology - hIstopathology Alignment. GLORIA processes whole-slide image regions, gene-expression profiles, and 3D MRI volumes through modality-specific encoders, projects them into a shared latent space, and aligns them with a Gramian contrastive loss that measures the volume spanned by the three modality embeddings. The aligned representations are fused through a cross-modal gating module and optimized jointly for three-class glioma grading and overall survival prediction. We evaluate GLORIA on a matched TCGA-GBM/LGG and BraTS21 cohort, comprising 132 patients with all three modalities. On the shared trimodal test set, GLORIA improves over the bimodal WSI-mRNA baseline in all the metrics considered.

Via

Access Paper or Ask Questions

GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

Jun 04, 2026

Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

Abstract:Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.

Via

Access Paper or Ask Questions

Closing the gap in multimodal medical representation alignment

Feb 23, 2026

Eleonora Grassucci, Giordano Cicchetti, Danilo Comminiello

Abstract:In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

* Accepted at MLSP2025

Via

Access Paper or Ask Questions

Closing the Modality Gap Aligns Group-Wise Semantics

Jan 26, 2026

Eleonora Grassucci, Giordano Cicchetti, Emanuele Frasca, Aurelio Uncini, Danilo Comminiello

Abstract:In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.

* ICLR 2026

Via

Access Paper or Ask Questions

Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education

Apr 07, 2025

Eleonora Grassucci, Gualtiero Grassucci, Aurelio Uncini, Danilo Comminiello

Figure 1 for Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education

Figure 2 for Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education

Figure 3 for Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education

Figure 4 for Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education

Abstract:Artificial Intelligence (AI) holds transformative potential in education, enabling personalized learning, enhancing inclusivity, and encouraging creativity and curiosity. In this paper, we explore how Large Language Models (LLMs) can act as both patient tutors and collaborative partners to enhance education delivery. As tutors, LLMs personalize learning by offering step-by-step explanations and addressing individual needs, making education more inclusive for students with diverse backgrounds or abilities. As collaborators, they expand students' horizons, supporting them in tackling complex, real-world problems and co-creating innovative projects. However, to fully realize these benefits, LLMs must be leveraged not as tools for providing direct solutions but rather to guide students in developing resolving strategies and finding learning paths together. Therefore, a strong emphasis should be placed on educating students and teachers on the successful use of LLMs to ensure their effective integration into classrooms. Through practical examples and real-world case studies, this paper illustrates how LLMs can make education more inclusive and engaging while empowering students to reach their full potential.

Via

Access Paper or Ask Questions

Gramian Multimodal Representation Learning and Alignment

Dec 16, 2024

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, Danilo Comminiello

Figure 1 for Gramian Multimodal Representation Learning and Alignment

Figure 2 for Gramian Multimodal Representation Learning and Alignment

Figure 3 for Gramian Multimodal Representation Learning and Alignment

Figure 4 for Gramian Multimodal Representation Learning and Alignment

Abstract:Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the $k$-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously. GRAM can replace cosine similarity in any downstream method, holding for 2 to $n$ modality and providing more meaningful alignment with respect to previous similarity measures. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification. The project page, the code, and the pretrained models are available at https://ispamm.github.io/GRAM/.

Via

Access Paper or Ask Questions

Lightweight Diffusion Models for Resource-Constrained Semantic Communication

Oct 03, 2024

Giovanni Pignata, Eleonora Grassucci, Giordano Cicchetti, Danilo Comminiello

Figure 1 for Lightweight Diffusion Models for Resource-Constrained Semantic Communication

Figure 2 for Lightweight Diffusion Models for Resource-Constrained Semantic Communication

Figure 3 for Lightweight Diffusion Models for Resource-Constrained Semantic Communication

Figure 4 for Lightweight Diffusion Models for Resource-Constrained Semantic Communication

Abstract:Recently, generative semantic communication models have proliferated as they are revolutionizing semantic communication frameworks, improving their performance, and opening the way to novel applications. Despite their impressive ability to regenerate content from the compressed semantic information received, generative models pose crucial challenges for communication systems in terms of high memory footprints and heavy computational load. In this paper, we present a novel Quantized GEnerative Semantic COmmunication framework, Q-GESCO. The core method of Q-GESCO is a quantized semantic diffusion model capable of regenerating transmitted images from the received semantic maps while simultaneously reducing computational load and memory footprint thanks to the proposed post-training quantization technique. Q-GESCO is robust to different channel noises and obtains comparable performance to the full precision counterpart in different scenarios saving up to 75% memory and 79% floating point operations. This allows resource-constrained devices to exploit the generative capabilities of Q-GESCO, widening the range of applications and systems for generative semantic communication frameworks. The code is available at https://github.com/ispamm/Q-GESCO.

Via

Access Paper or Ask Questions

Rethinking Multi-User Semantic Communications with Deep Generative Models

May 16, 2024

Eleonora Grassucci, Jinho Choi, Jihong Park, Riccardo F. Gramaccioni, Giordano Cicchetti, Danilo Comminiello

Figure 1 for Rethinking Multi-User Semantic Communications with Deep Generative Models

Figure 2 for Rethinking Multi-User Semantic Communications with Deep Generative Models

Figure 3 for Rethinking Multi-User Semantic Communications with Deep Generative Models

Figure 4 for Rethinking Multi-User Semantic Communications with Deep Generative Models

Abstract:In recent years, novel communication strategies have emerged to face the challenges that the increased number of connected devices and the higher quality of transmitted information are posing. Among them, semantic communication obtained promising results especially when combined with state-of-the-art deep generative models, such as large language or diffusion models, able to regenerate content from extremely compressed semantic information. However, most of these approaches focus on single-user scenarios processing the received content at the receiver on top of conventional communication systems. In this paper, we propose to go beyond these methods by developing a novel generative semantic communication framework tailored for multi-user scenarios. This system assigns the channel to users knowing that the lost information can be filled in with a diffusion model at the receivers. Under this innovative perspective, OFDMA systems should not aim to transmit the largest part of information, but solely the bits necessary to the generative model to semantically regenerate the missing ones. The thorough experimental evaluation shows the capabilities of the novel diffusion model and the effectiveness of the proposed framework, leading towards a GenAI-based next generation of communications.

* Under review in IEEE Journal on Selected Areas in Communications

Via

Access Paper or Ask Questions

Language-Oriented Semantic Latent Representation for Image Transmission

May 16, 2024

Giordano Cicchetti, Eleonora Grassucci, Jihong Park, Jinho Choi, Sergio Barbarossa, Danilo Comminiello

Figure 1 for Language-Oriented Semantic Latent Representation for Image Transmission

Figure 2 for Language-Oriented Semantic Latent Representation for Image Transmission

Figure 3 for Language-Oriented Semantic Latent Representation for Image Transmission

Figure 4 for Language-Oriented Semantic Latent Representation for Image Transmission

Abstract:In the new paradigm of semantic communication (SC), the focus is on delivering meanings behind bits by extracting semantic information from raw data. Recent advances in data-to-text models facilitate language-oriented SC, particularly for text-transformed image communication via image-to-text (I2T) encoding and text-to-image (T2I) decoding. However, although semantically aligned, the text is too coarse to precisely capture sophisticated visual features such as spatial locations, color, and texture, incurring a significant perceptual difference between intended and reconstructed images. To address this limitation, in this paper, we propose a novel language-oriented SC framework that communicates both text and a compressed image embedding and combines them using a latent diffusion model to reconstruct the intended image. Experimental results validate the potential of our approach, which transmits only 2.09\% of the original image size while achieving higher perceptual similarities in noisy communication channels compared to a baseline SC method that communicates only through text.The code is available at https://github.com/ispamm/Img2Img-SC/ .

* Under review at IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2024

Via

Access Paper or Ask Questions

Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning

May 11, 2024

Danilo Comminiello, Eleonora Grassucci, Danilo P. Mandic, Aurelio Uncini

Figure 1 for Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning

Figure 2 for Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning

Figure 3 for Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning

Figure 4 for Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning

Abstract:Hypercomplex algebras have recently been gaining prominence in the field of deep learning owing to the advantages of their division algebras over real vector spaces and their superior results when dealing with multidimensional signals in real-world 3D and 4D paradigms. This paper provides a foundational framework that serves as a roadmap for understanding why hypercomplex deep learning methods are so successful and how their potential can be exploited. Such a theoretical framework is described in terms of inductive bias, i.e., a collection of assumptions, properties, and constraints that are built into training algorithms to guide their learning process toward more efficient and accurate solutions. We show that it is possible to derive specific inductive biases in the hypercomplex domains, which extend complex numbers to encompass diverse numbers and data structures. These biases prove effective in managing the distinctive properties of these domains, as well as the complex structures of multidimensional and multimodal signals. This novel perspective for hypercomplex deep learning promises to both demystify this class of methods and clarify their potential, under a unifying framework, and in this way promotes hypercomplex models as viable alternatives to traditional real-valued deep learning for multidimensional signal processing.

* Accepted for Publication in IEEE Signal Processing Magazine

Via

Access Paper or Ask Questions