Convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable success in various vision tasks. However, many architectures do not consider interactions between feature maps from different stages and scales, which may limit their performance. In this work, we propose a simple add-on attention module to overcome these limitations via multi-stage and cross-scale interactions. Specifically, the proposed Multi-Stage Cross-Scale Attention (MSCSA) module takes feature maps from different stages to enable multi-stage interactions and achieves cross-scale interactions by computing self-attention at different scales based on the multi-stage feature maps. Our experiments on several downstream tasks show that MSCSA provides a significant performance boost with modest additional FLOPs and runtime.
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
Recent advances in deep learning have improved the segmentation accuracy of subcortical brain structures, which would be useful in neuroimaging studies of many neurological disorders. However, most of the previous deep learning work does not investigate the specific difficulties that exist in segmenting extremely small but important brain regions such as the amygdala and its subregions. To tackle this challenging task, a novel 3D Bayesian fully convolutional neural network was developed to apply a dilated dualpathway approach that retains fine details and utilizes both local and more global contextual information to automatically segment the amygdala and its subregions at high precision. The proposed method provides insights on network design and sampling strategy that target segmentations of small 3D structures. In particular, this study confirms that a large context, enabled by a large field of view, is beneficial for segmenting small objects; furthermore, precise contextual information enabled by dilated convolutions allows for better boundary localization, which is critical for examining the morphology of the structure. In addition, it is demonstrated that the uncertainty information estimated from our network may be leveraged to identify atypicality in data. Our method was compared with two state-of-the-art deep learning models and a traditional multi-atlas approach, and exhibited excellent performance as measured both by Dice overlap as well as average symmetric surface distance. To the best of our knowledge, this work is the first deep learning-based approach that targets the subregions of the amygdala.