Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthias Bethge

Unsupervised Object Learning via Common Fate

Oct 13, 2021

Matthias Tangemann, Steffen Schneider, Julius von Kügelgen, Francesco Locatello, Peter Gehler, Thomas Brox, Matthias Kümmerer, Matthias Bethge, Bernhard Schölkopf

Figure 1 for Unsupervised Object Learning via Common Fate

Figure 2 for Unsupervised Object Learning via Common Fate

Figure 3 for Unsupervised Object Learning via Common Fate

Figure 4 for Unsupervised Object Learning via Common Fate

Abstract:Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling. We decompose this problem into three easier subtasks, and provide candidate solutions for each of them. Inspired by the Common Fate Principle of Gestalt Psychology, we first extract (noisy) masks of moving objects via unsupervised motion segmentation. Second, generative models are trained on the masks of the background and the moving objects, respectively. Third, background and foreground models are combined in a conditional "dead leaves" scene model to sample novel scene configurations where occlusions and depth layering arise naturally. To evaluate the individual stages, we introduce the Fishbowl dataset positioned between complex real-world scenes and common object-centric benchmarks of simplistic objects. We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos, and represent scenes in a modular fashion that allows sampling plausible scenes outside the training distribution by permitting, for instance, object numbers or densities not observed in the training set.

Via

Access Paper or Ask Questions

Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

Jul 23, 2021

Lukas Schott, Julius von Kügelgen, Frederik Träuble, Peter Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, Wieland Brendel

Figure 1 for Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

Figure 2 for Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

Figure 3 for Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

Figure 4 for Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

Abstract:An important component for generalization in machine learning is to uncover underlying latent factors of variation as well as the mechanism through which each factor acts in the world. In this paper, we test whether 17 unsupervised, weakly supervised, and fully supervised representation learning approaches correctly infer the generative factors of variation in simple datasets (dSprites, Shapes3D, MPI3D). In contrast to prior robustness work that introduces novel factors of variation during test time, such as blur or other (un)structured noise, we here recompose, interpolate, or extrapolate only existing factors of variation from the training data set (e.g., small and medium-sized objects during training and large objects during testing). Models that learn the correct mechanism should be able to generalize to this benchmark. In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. Moreover, the generalization capabilities of all tested models drop significantly as we move from artificial datasets towards more realistic real-world datasets. Despite their inability to identify the correct mechanism, the models are quite modular as their ability to infer other in-distribution factors remains fairly stable, providing only a single factor is out-of-distribution. These results point to an important yet understudied problem of learning mechanistic models of observations that can facilitate generalization.

Via

Access Paper or Ask Questions

How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Jun 23, 2021

Roland S. Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas S. A. Wallis, Wieland Brendel

Figure 1 for How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Figure 2 for How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Figure 3 for How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Figure 4 for How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Abstract:One widely used approach towards understanding the inner workings of deep convolutional neural networks is to visualize unit responses via activation maximization. Feature visualizations via activation maximization are thought to provide humans with precise information about the image features that cause a unit to be activated. If this is indeed true, these synthetic images should enable humans to predict the effect of an intervention, such as whether occluding a certain patch of the image (say, a dog's head) changes a unit's activation. Here, we test this hypothesis by asking humans to predict which of two square occlusions causes a larger change to a unit's activation. Both a large-scale crowdsourced experiment and measurements with experts show that on average, the extremely activating feature visualizations by Olah et al. (2017) indeed help humans on this task ($67 \pm 4\%$ accuracy; baseline performance without any visualizations is $60 \pm 3\%$). However, they do not provide any significant advantage over other visualizations (such as e.g. dataset samples), which yield similar performance ($66 \pm 3\%$ to $67 \pm 3\%$ accuracy). Taken together, we propose an objective psychophysical task to quantify the benefit of unit-level interpretability methods for humans, and find no evidence that feature visualizations provide humans with better "causal understanding" than simple alternative visualizations.

* ICML 2021 XAI workshop version. Joint first and last authors. Project website at https://brendel-group.github.io/causal-understanding-via-visualizations/

Via

Access Paper or Ask Questions

Partial success in closing the gap between human and machine vision

Jun 14, 2021

Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wieland Brendel

Figure 1 for Partial success in closing the gap between human and machine vision

Figure 2 for Partial success in closing the gap between human and machine vision

Figure 3 for Partial success in closing the gap between human and machine vision

Figure 4 for Partial success in closing the gap between human and machine vision

Abstract:A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the "missing human baseline" by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark here: https://github.com/bethgelab/model-vs-human/

* A preliminary version of this work was presented as Oral at the 2020 NeurIPS workshop on "Shared Visual Representations in Human & Machine Intelligence" (arXiv:2010.08377)

Via

Access Paper or Ask Questions

Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

May 27, 2021

Akis Linardos, Matthias Kümmerer, Ori Press, Matthias Bethge

Figure 1 for Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

Figure 2 for Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

Figure 3 for Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

Figure 4 for Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

Abstract:Since 2014 transfer learning has become the key driver for the improvement of spatial saliency prediction; however, with stagnant progress in the last 3-5 years. We conduct a large-scale transfer learning study which tests different ImageNet backbones, always using the same read out architecture and learning protocol adopted from DeepGaze II. By replacing the VGG19 backbone of DeepGaze II with ResNet50 features we improve the performance on saliency prediction from 78% to 85%. However, as we continue to test better ImageNet models as backbones (such as EfficientNetB5) we observe no additional improvement on saliency prediction. By analyzing the backbones further, we find that generalization to other datasets differs substantially, with models being consistently overconfident in their fixation predictions. We show that by combining multiple backbones in a principled manner a good confidence calibration on unseen datasets can be achieved. This yields a significant leap in benchmark performance in and out-of-domain with a 15 percent point improvement over DeepGaze II to 93% on MIT1003, marking a new state of the art on the MIT/Tuebingen Saliency Benchmark in all available metrics (AUC: 88.3%, sAUC: 79.4%, CC: 82.4%).

* Joint first authors

Via

Access Paper or Ask Questions

Adapting ImageNet-scale models to complex distribution shifts with self-learning

Apr 28, 2021

Evgenia Rusak, Steffen Schneider, Peter Gehler, Oliver Bringmann, Wieland Brendel, Matthias Bethge

Figure 1 for Adapting ImageNet-scale models to complex distribution shifts with self-learning

Figure 2 for Adapting ImageNet-scale models to complex distribution shifts with self-learning

Figure 3 for Adapting ImageNet-scale models to complex distribution shifts with self-learning

Figure 4 for Adapting ImageNet-scale models to complex distribution shifts with self-learning

Abstract:While self-learning methods are an important component in many recent domain adaptation techniques, they are not yet comprehensively evaluated on ImageNet-scale datasets common in robustness research. In extensive experiments on ResNet and EfficientNet models, we find that three components are crucial for increasing performance with self-learning: (i) using short update times between the teacher and the student network, (ii) fine-tuning only few affine parameters distributed across the network, and (iii) leveraging methods from robust classification to counteract the effect of label noise. We use these insights to obtain drastically improved state-of-the-art results on ImageNet-C (22.0% mCE), ImageNet-R (17.4% error) and ImageNet-A (14.8% error). Our techniques yield further improvements in combination with previously proposed robustification methods. Self-learning is able to reduce the top-1 error to a point where no substantial further progress can be expected. We therefore re-purpose the dataset from the Visual Domain Adaptation Challenge 2019 and use a subset of it as a new robustness benchmark (ImageNet-D) which proves to be a more challenging dataset for all current state-of-the-art models (58.2% error) to guide future research efforts at the intersection of robustness and domain adaptation on ImageNet scale.

* Web: https://domainadaptation.org/selflearning

Via

Access Paper or Ask Questions

State-of-the-Art in Human Scanpath Prediction

Feb 24, 2021

Matthias Kümmerer, Matthias Bethge

Figure 1 for State-of-the-Art in Human Scanpath Prediction

Figure 2 for State-of-the-Art in Human Scanpath Prediction

Figure 3 for State-of-the-Art in Human Scanpath Prediction

Figure 4 for State-of-the-Art in Human Scanpath Prediction

Abstract:The last years have seen a surge in models predicting the scanpaths of fixations made by humans when viewing images. However, the field is lacking a principled comparison of those models with respect to their predictive power. In the past, models have usually been evaluated based on comparing human scanpaths to scanpaths generated from the model. Here, instead we evaluate models based on how well they predict each fixation in a scanpath given the previous scanpath history. This makes model evaluation closely aligned with the biological processes thought to underly scanpath generation and allows to apply established saliency metrics like AUC and NSS in an intuitive and interpretable way. We evaluate many existing models of scanpath prediction on the datasets MIT1003, MIT300, CAT2000 train and CAT200 test, for the first time giving a detailed picture of the current state of the art of human scanpath prediction. We also show that the discussed method of model benchmarking allows for more detailed analyses leading to interesting insights about where and when models fail to predict human behaviour. The MIT/Tuebingen Saliency Benchmark will implement the evaluation of scanpath models as detailed here, allowing researchers to score their models on the established benchmark datasets MIT300 and CAT2000.

Via

Access Paper or Ask Questions

Contrastive Learning Inverts the Data Generating Process

Feb 17, 2021

Roland S. Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, Wieland Brendel

Figure 1 for Contrastive Learning Inverts the Data Generating Process

Figure 2 for Contrastive Learning Inverts the Data Generating Process

Figure 3 for Contrastive Learning Inverts the Data Generating Process

Figure 4 for Contrastive Learning Inverts the Data Generating Process

Abstract:Contrastive learning has recently seen tremendous success in self-supervised learning. So far, however, it is largely unclear why the learned representations generalize so effectively to a large variety of downstream tasks. We here prove that feedforward models trained with objectives belonging to the commonly used InfoNCE family learn to implicitly invert the underlying generative model of the observed data. While the proofs make certain statistical assumptions about the generative model, we observe empirically that our findings hold even if these assumptions are severely violated. Our theory highlights a fundamental connection between contrastive learning, generative modeling, and nonlinear independent component analysis, thereby furthering our understanding of the learned representations as well as providing a theoretical foundation to derive more effective contrastive losses.

* The first three authors, as well as the last two authors, contributed equally. Code is available at https://brendel-group.github.io/cl-ica

Via

Access Paper or Ask Questions

Closing the Generalization Gap in One-Shot Object Detection

Nov 09, 2020

Claudio Michaelis, Matthias Bethge, Alexander S. Ecker

Figure 1 for Closing the Generalization Gap in One-Shot Object Detection

Figure 2 for Closing the Generalization Gap in One-Shot Object Detection

Figure 3 for Closing the Generalization Gap in One-Shot Object Detection

Figure 4 for Closing the Generalization Gap in One-Shot Object Detection

Abstract:Despite substantial progress in object detection and few-shot learning, detecting objects based on a single example - one-shot object detection - remains a challenge: trained models exhibit a substantial generalization gap, where object categories used during training are detected much more reliably than novel ones. Here we show that this generalization gap can be nearly closed by increasing the number of object categories used during training. Our results show that the models switch from memorizing individual categories to learning object similarity over the category distribution, enabling strong generalization at test time. Importantly, in this regime standard methods to improve object detection models like stronger backbones or longer training schedules also benefit novel categories, which was not the case for smaller datasets like COCO. Our results suggest that the key to strong few-shot detection models may not lie in sophisticated metric learning approaches, but instead in scaling the number of categories. Future data annotation efforts should therefore focus on wider datasets and annotate a larger number of categories rather than gathering more images or instances per category.

Via

Access Paper or Ask Questions

Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations

Oct 23, 2020

Judy Borowski, Roland S. Zimmermann, Judith Schepers, Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, Wieland Brendel

Figure 1 for Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations

Figure 2 for Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations

Figure 3 for Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations

Figure 4 for Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations

Abstract:Feature visualizations such as synthetic maximally activating images are a widely used explanation method to better understand the information processing of convolutional neural networks (CNNs). At the same time, there are concerns that these visualizations might not accurately represent CNNs' inner workings. Here, we measure how much extremely activating images help humans to predict CNN activations. Using a well-controlled psychophysical paradigm, we compare the informativeness of synthetic images (Olah et al., 2017) with a simple baseline visualization, namely exemplary natural images that also strongly activate a specific feature map. Given either synthetic or natural reference images, human participants choose which of two query images leads to strong positive activation. The experiment is designed to maximize participants' performance, and is the first to probe intermediate instead of final layer representations. We find that synthetic images indeed provide helpful information about feature map activations (82% accuracy; chance would be 50%). However, natural images-originally intended to be a baseline-outperform synthetic images by a wide margin (92% accuracy). Additionally, participants are faster and more confident for natural images, whereas subjective impressions about the interpretability of feature visualization are mixed. The higher informativeness of natural images holds across most layers, for both expert and lay participants as well as for hand- and randomly-picked feature visualizations. Even if only a single reference image is given, synthetic images provide less information than natural images (65% vs. 73%). In summary, popular synthetic images from feature visualizations are significantly less informative for assessing CNN activations than natural images. We argue that future visualization methods should improve over this simple baseline.

Via

Access Paper or Ask Questions