Learning the similarity between images constitutes the foundation for numerous vision tasks. The common paradigm is discriminative metric learning, which seeks an embedding that separates different training classes. However, the main challenge is to learn a metric that not only generalizes from training to novel, but related, test samples. It should also transfer to different object classes. So what complementary information is missed by the discriminative paradigm? Besides finding characteristics that separate between classes, we also need them to likely occur in novel categories, which is indicated if they are shared across training classes. This work investigates how to learn such characteristics without the need for extra annotations or training data. By formulating our approach as a novel triplet sampling strategy, it can be easily applied on top of recent ranking loss frameworks. Experiments show that, independent of the underlying network architecture and the specific ranking loss, our approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets.
Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training triplets. Thus, sampling strategies that decide when to use which training sample during learning are crucial. Currently, the prominent paradigm are fixed or curriculum sampling strategies that are predefined before training starts. However, the problem truly calls for a sampling process that adjusts based on the actual state of the similarity representation during training. We, therefore, employ reinforcement learning and have a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity. Experiments on benchmark datasets using standard triplet-based losses show that our adaptive sampling strategy significantly outperforms fixed sampling strategies. Moreover, although our adaptive sampling is only applied on top of basic triplet-learning frameworks, we reach competitive results to state-of-the-art approaches that employ diverse additional learning signals or strong ensemble architectures. Code can be found under https://github.com/Confusezius/CVPR2020_PADS.
Deep Metric Learning (DML) is arguably one of the most influential lines of research for learning visual similarities with many proposed approaches every year. Although the field benefits from the rapid progress, the divergence in training protocols, architectures, and parameter choices make an unbiased comparison difficult. To provide a consistent reference point, we revisit the most widely used DML objective functions and conduct a study of the crucial parameter choices as well as the commonly neglected mini-batch sampling process. Based on our analysis, we uncover a correlation between the embedding space compression and the generalization performance of DML models. Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets.
Metric learning seeks to embed images of objects suchthat class-defined relations are captured by the embeddingspace. However, variability in images is not just due to different depicted object classes, but also depends on other latent characteristics such as viewpoint or illumination. In addition to these structured properties, random noise further obstructs the visual relations of interest. The common approach to metric learning is to enforce a representation that is invariant under all factors but the ones of interest. In contrast, we propose to explicitly learn the latent characteristics that are shared by and go across object classes. We can then directly explain away structured visual variability, rather than assuming it to be unknown random noise. We propose a novel surrogate task to learn visual characteristics shared across classes with a separate encoder. This encoder is trained jointly with the encoder for class information by reducing their mutual information. On five standard image retrieval benchmarks the approach significantly improves upon the state-of-the-art.
In this paper we propose a novel procedure to improve liver and liver lesion segmentation from CT scans for U-Net based models. Our method is an extension to standard segmentation pipelines allowing for more fine-grained control over the network output by focusing on higher target recall or reduction of noisy false-positive predictions, thereby also boosting overall segmentation performance. To achieve this, we include segmentation errors after a primary learning step into a new learning process appended to the main training setup, allowing the model to find features which explain away previous errors. We evaluate this on distinct architectures including cascaded two- and three-dimensional as well as combined learning setups for multitask segmentation. Liver and lesion segmentation data is provided by the Liver Tumor Segmentationchallenge (LiTS), with an increase in dice score of up to 3 points.
At present, lesion segmentation is still performed manually (or semi-automatically) by medical experts. To facilitate this process, we contribute a fully-automatic lesion segmentation pipeline. This work proposes a method as a part of the LiTS (Liver Tumor Segmentation Challenge) competition for ISBI 17 and MICCAI 17 comparing methods for automatics egmentation of liver lesions in CT scans. By utilizing cascaded, densely connected 2D U-Nets and a Tversky-coefficient based loss function, our framework achieves very good shape extractions with high detection sensitivity, with competitive scores at time of publication. In addition, adjusting hyperparameters in our Tversky-loss allows to tune the network towards higher sensitivity or robustness.