Abstract:Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.
Abstract:Convolutional neural networks (CNNs) learn abstract features to perform object classification, but understanding these features remains challenging due to difficult-to-interpret results or high computational costs. We propose an automatic method to visualize and systematically analyze learned features in CNNs. Specifically, we introduce a linking network that maps the penultimate layer of a pre-trained classifier to the latent space of a generative model (StyleGAN-XL), thereby enabling an interpretable, human-friendly visualization of the classifier's representations. Our findings indicate a congruent semantic order in both spaces, enabling a direct linear mapping between them. Training the linking network is computationally inexpensive and decoupled from training both the GAN and the classifier. We introduce an automatic pipeline that utilizes such GAN-based visualizations to quantify learned representations by analyzing activation changes in the classifier in the image domain. This quantification allows us to systematically study the learned representations in several thousand units simultaneously and to extract and visualize units selective for specific semantic concepts. Further, we illustrate how our method can be used to quantify and interpret the classifier's decision boundary using counterfactual examples. Overall, our method offers systematic and objective perspectives on learned abstract representations in CNNs. https://github.com/kaschube-lab/LinkingInStyle.git




Abstract:Advances in microscopy imaging enable researchers to visualize structures at the nanoscale level thereby unraveling intricate details of biological organization. However, challenges such as image noise, photobleaching of fluorophores, and low tolerability of biological samples to high light doses remain, restricting temporal resolutions and experiment durations. Reduced laser doses enable longer measurements at the cost of lower resolution and increased noise, which hinders accurate downstream analyses. Here we train a denoising diffusion probabilistic model (DDPM) to predict high-resolution images by conditioning the model on low-resolution information. Additionally, the probabilistic aspect of the DDPM allows for repeated generation of images that tend to further increase the signal-to-noise ratio. We show that our model achieves a performance that is better or similar to the previously best-performing methods, across four highly diverse datasets. Importantly, while any of the previous methods show competitive performance for some, but not all datasets, our method consistently achieves high performance across all four data sets, suggesting high generalizability.