Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jürgen Frikel

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

Jun 01, 2026

Catyana Heyne, Jürgen Frikel, Filippo Riccio

Abstract:Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

Via

Access Paper or Ask Questions

Feature reconstruction from incomplete tomographic data without detour

Feb 22, 2022

Simon Göppel, Jürgen Frikel, Markus Haltmeier

Figure 1 for Feature reconstruction from incomplete tomographic data without detour

Figure 2 for Feature reconstruction from incomplete tomographic data without detour

Figure 3 for Feature reconstruction from incomplete tomographic data without detour

Figure 4 for Feature reconstruction from incomplete tomographic data without detour

Abstract:In this paper, we consider the problem of feature reconstruction from incomplete x-ray CT data. Such problems occurs, e.g., as a result of dose reduction in the context medical imaging. Since image reconstruction from incomplete data is a severely ill-posed problem, the reconstructed images may suffer from characteristic artefacts or missing features, and significantly complicate subsequent image processing tasks (e.g., edge detection or segmentation). In this paper, we introduce a novel framework for the robust reconstruction of convolutional image features directly from CT data, without the need of computing a reconstruction firs. Within our framework we use non-linear (variational) regularization methods that can be adapted to a variety of feature reconstruction tasks and to several limited data situations . In our numerical experiments, we consider several instances of edge reconstructions from angularly undersampled data and show that our approach is able to reliably reconstruct feature maps in this case.

Via

Access Paper or Ask Questions