Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Steffen Vogler

Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

May 14, 2026

Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh, Anwai Archit, Surbhi Motghare(+10 more)

Abstract:Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

* 23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-baker

Via

Access Paper or Ask Questions

Content-based 3D Image Retrieval and a ColBERT-inspired Re-ranking for Tumor Flagging and Staging

Jul 23, 2025

Farnaz Khun Jush, Steffen Vogler, Matthias Lenga

Abstract:The increasing volume of medical images poses challenges for radiologists in retrieving relevant cases. Content-based image retrieval (CBIR) systems offer potential for efficient access to similar cases, yet lack standardized evaluation and comprehensive studies. Building on prior studies for tumor characterization via CBIR, this study advances CBIR research for volumetric medical images through three key contributions: (1) a framework eliminating reliance on pre-segmented data and organ-specific datasets, aligning with large and unstructured image archiving systems, i.e. PACS in clinical practice; (2) introduction of C-MIR, a novel volumetric re-ranking method adapting ColBERT's contextualized late interaction mechanism for 3D medical imaging; (3) comprehensive evaluation across four tumor sites using three feature extractors and three database configurations. Our evaluations highlight the significant advantages of C-MIR. We demonstrate the successful adaptation of the late interaction principle to volumetric medical images, enabling effective context-aware re-ranking. A key finding is C-MIR's ability to effectively localize the region of interest, eliminating the need for pre-segmentation of datasets and offering a computationally efficient alternative to systems relying on expensive data enrichment steps. C-MIR demonstrates promising improvements in tumor flagging, achieving improved performance, particularly for colon and lung tumors (p<0.05). C-MIR also shows potential for improving tumor staging, warranting further exploration of its capabilities. Ultimately, our work seeks to bridge the gap between advanced retrieval techniques and their practical applications in healthcare, paving the way for improved diagnostic processes.

Via

Access Paper or Ask Questions

DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Jul 16, 2024

Guillermo Jimenez-Perez, Pedro Osorio, Josef Cersovsky, Javier Montalt-Tordera, Jens Hooge, Steffen Vogler, Sadegh Mohammadi

Figure 1 for DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Figure 2 for DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Figure 3 for DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Figure 4 for DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Abstract:Diffusion models (DMs) have emerged as powerful foundation models for a variety of tasks, with a large focus in synthetic image generation. However, their requirement of large annotated datasets for training limits their applicability in medical imaging, where datasets are typically smaller and sparsely annotated. We introduce DiNO-Diffusion, a self-supervised method for training latent diffusion models (LDMs) that conditions the generation process on image embeddings extracted from DiNO. By eliminating the reliance on annotations, our training leverages over 868k unlabelled images from public chest X-Ray (CXR) datasets. Despite being self-supervised, DiNO-Diffusion shows comprehensive manifold coverage, with FID scores as low as 4.7, and emerging properties when evaluated in downstream tasks. It can be used to generate semantically-diverse synthetic datasets even from small data pools, demonstrating up to 20% AUC increase in classification performance when used for data augmentation. Images were generated with different sampling strategies over the DiNO embedding manifold and using real images as a starting point. Results suggest, DiNO-Diffusion could facilitate the creation of large datasets for flexible training of downstream AI models from limited amount of real data, while also holding potential for privacy preservation. Additionally, DiNO-Diffusion demonstrates zero-shot segmentation performance of up to 84.4% Dice score when evaluating lung lobe segmentation. This evidences good CXR image-anatomy alignment, akin to segmenting using textual descriptors on vanilla DMs. Finally, DiNO-Diffusion can be easily adapted to other medical imaging modalities or state-of-the-art diffusion models, opening the door for large-scale, multi-domain image generation pipelines for medical imaging.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Content-Based Image Retrieval for Multi-Class Volumetric Radiology Images: A Benchmark Study

May 15, 2024

Farnaz Khun Jush, Steffen Vogler, Tuan Truong, Matthias Lenga

Figure 1 for Content-Based Image Retrieval for Multi-Class Volumetric Radiology Images: A Benchmark Study

Figure 2 for Content-Based Image Retrieval for Multi-Class Volumetric Radiology Images: A Benchmark Study

Figure 3 for Content-Based Image Retrieval for Multi-Class Volumetric Radiology Images: A Benchmark Study

Figure 4 for Content-Based Image Retrieval for Multi-Class Volumetric Radiology Images: A Benchmark Study

Abstract:While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.

* 23 pages, 9 Figures, 13 Tables

Via

Access Paper or Ask Questions

Croissant: A Metadata Format for ML-Ready Datasets

Mar 28, 2024

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson(+9 more)

Abstract:Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

* Preprint. Contributors listed in alphabetical order

Via

Access Paper or Ask Questions

Medical Image Retrieval Using Pretrained Embeddings

Nov 22, 2023

Farnaz Khun Jush, Tuan Truong, Steffen Vogler, Matthias Lenga

Figure 1 for Medical Image Retrieval Using Pretrained Embeddings

Figure 2 for Medical Image Retrieval Using Pretrained Embeddings

Figure 3 for Medical Image Retrieval Using Pretrained Embeddings

Figure 4 for Medical Image Retrieval Using Pretrained Embeddings

Abstract:A wide range of imaging techniques and data formats available for medical images make accurate retrieval from image databases challenging. Efficient retrieval systems are crucial in advancing medical research, enabling large-scale studies and innovative diagnostic tools. Thus, addressing the challenges of medical image retrieval is essential for the continued enhancement of healthcare and research. In this study, we evaluated the feasibility of employing four state-of-the-art pretrained models for medical image retrieval at modality, body region, and organ levels and compared the results of two similarity indexing approaches. Since the employed networks take 2D images, we analyzed the impacts of weighting and sampling strategies to incorporate 3D information during retrieval of 3D volumes. We showed that medical image retrieval is feasible using pretrained networks without any additional training or fine-tuning steps. Using pretrained embeddings, we achieved a recall of 1 for various tasks at modality, body region, and organ level.

* 8 pages, 3 figures, 4 tables

Via

Access Paper or Ask Questions

Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling

Apr 09, 2021

Jonas Dippel, Steffen Vogler, Johannes Höhne

Figure 1 for Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling

Figure 2 for Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling

Figure 3 for Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling

Figure 4 for Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling

Abstract:This paper presents Contrastive Reconstruction, ConRec - a self-supervised learning algorithm that obtains image representations by jointly optimizing a contrastive and a self-reconstruction loss. We showcase that state-of-the-art contrastive learning methods (e.g. SimCLR) have shortcomings to capture fine-grained visual features in their representations. ConRec extends the SimCLR framework by adding (1) a self-reconstruction task and (2) an attention mechanism within the contrastive learning task. This is accomplished by applying a simple encoder-decoder architecture with two heads. We show that both extensions contribute towards an improved vector representation for images with fine-grained visual features. Combining those concepts, ConRec outperforms SimCLR and SimCLR with Attention-Pooling on fine-grained classification datasets.

Via

Access Paper or Ask Questions