Camouflaged object detection is a challenging task that aims to identify objects having similar texture to the surroundings. This paper presents to amplify the subtle texture difference between camouflaged objects and the background for camouflaged object detection by formulating multiple texture-aware refinement modules to learn the texture-aware features in a deep convolutional neural network. The texture-aware refinement module computes the covariance matrices of feature responses to extract the texture information, designs an affinity loss to learn a set of parameter maps that help to separate the texture between camouflaged objects and the background, and adopts a boundary-consistency loss to explore the object detail structures.We evaluate our network on the benchmark dataset for camouflaged object detection both qualitatively and quantitatively. Experimental results show that our approach outperforms various state-of-the-art methods by a large margin.
Annotation scarcity is a long-standing problem in medical image analysis area. To efficiently leverage limited annotations, abundant unlabeled data are additionally exploited in semi-supervised learning, while well-established cross-modality data are investigated in domain adaptation. In this paper, we aim to explore the feasibility of concurrently leveraging both unlabeled data and cross-modality data for annotation-efficient cardiac segmentation. To this end, we propose a cutting-edge semi-supervised domain adaptation framework, namely Dual-Teacher++. Besides directly learning from limited labeled target domain data (e.g., CT) via a student model adopted by previous literature, we design novel dual teacher models, including an inter-domain teacher model to explore cross-modality priors from source domain (e.g., MR) and an intra-domain teacher model to investigate the knowledge beneath unlabeled target domain. In this way, the dual teacher models would transfer acquired inter- and intra-domain knowledge to the student model for further integration and exploitation. Moreover, to encourage reliable dual-domain knowledge transfer, we enhance the inter-domain knowledge transfer on the samples with higher similarity to target domain after appearance alignment, and also strengthen intra-domain knowledge transfer of unlabeled target data with higher prediction confidence. In this way, the student model can obtain reliable dual-domain knowledge and yield improved performance on target domain data. We extensively evaluated the feasibility of our method on the MM-WHS 2017 challenge dataset. The experiments have demonstrated the superiority of our framework over other semi-supervised learning and domain adaptation methods. Moreover, our performance gains could be yielded in bidirections,i.e., adapting from MR to CT, and from CT to MR.
Supervised learning under label noise has seen numerous advances recently, while existing theoretical findings and empirical results broadly build up on the class-conditional noise (CCN) assumption that the noise is independent of input features given the true label. In this work, we present a theoretical hypothesis testing and prove that noise in real-world dataset is unlikely to be CCN, which confirms that label noise should depend on the instance and justifies the urgent need to go beyond the CCN assumption.The theoretical results motivate us to study the more general and practical-relevant instance-dependent noise (IDN). To stimulate the development of theory and methodology on IDN, we formalize an algorithm to generate controllable IDN and present both theoretical and empirical evidence to show that IDN is semantically meaningful and challenging. As a primary attempt to combat IDN, we present a tiny algorithm termed self-evolution average label (SEAL), which not only stands out under IDN with various noise fractions, but also improves the generalization on real-world noise benchmark Clothing1M. Our code is released. Notably, our theoretical analysis in Section 2 provides rigorous motivations for studying IDN, which is an important topic that deserves more research attention in future.
For multi-class classification under class-conditional label noise, we prove that the accuracy metric itself can be robust. We concretize this finding's inspiration in two essential aspects: training and validation, with which we address critical issues in learning with noisy labels. For training, we show that maximizing training accuracy on sufficiently many noisy samples yields an approximately optimal classifier. For validation, we prove that a noisy validation set is reliable, addressing the critical demand of model selection in scenarios like hyperparameter-tuning and early stopping. Previously, model selection using noisy validation samples has not been theoretically justified. We verify our theoretical results and additional claims with extensive experiments. We show characterizations of models trained with noisy labels, motivated by our theoretical results, and verify the utility of a noisy validation set by showing the impressive performance of a framework termed noisy best teacher and student (NTS). Our code is released.
Automatic surgical gesture recognition is fundamentally important to enable intelligent cognitive assistance in robotic surgery. With recent advancement in robot-assisted minimally invasive surgery, rich information including surgical videos and robotic kinematics can be recorded, which provide complementary knowledge for understanding surgical gestures. However, existing methods either solely adopt uni-modal data or directly concatenate multi-modal representations, which can not sufficiently exploit the informative correlations inherent in visual and kinematics data to boost gesture recognition accuracies. In this regard, we propose a novel approach of multimodal relational graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information through interactive message propagation in the latent feature space. In specific, we first extract embeddings from video and kinematics sequences with temporal convolutional networks and LSTM units. Next, we identify multi-relations in these multi-modal features and model them through a hierarchical relational graph learning module. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset, outperforming current uni-modal and multi-modal methods on both suturing and knot typing tasks. Furthermore, we validated our method on in-house visual-kinematics datasets collected with da Vinci Research Kit (dVRK) platforms in two centers, with consistent promising performance achieved.
Deep convolutional neural networks have significantly boosted the performance of fundus image segmentation when test datasets have the same distribution as the training datasets. However, in clinical practice, medical images often exhibit variations in appearance for various reasons, e.g., different scanner vendors and image quality. These distribution discrepancies could lead the deep networks to over-fit on the training datasets and lack generalization ability on the unseen test datasets. To alleviate this issue, we present a novel Domain-oriented Feature Embedding (DoFE) framework to improve the generalization ability of CNNs on unseen target domains by exploring the knowledge from multiple source domains. Our DoFE framework dynamically enriches the image features with additional domain prior knowledge learned from multi-source domains to make the semantic features more discriminative. Specifically, we introduce a Domain Knowledge Pool to learn and memorize the prior information extracted from multi-source domains. Then the original image features are augmented with domain-oriented aggregated features, which are induced from the knowledge pool based on the similarity between the input image and multi-source domain images. We further design a novel domain code prediction branch to infer this similarity and employ an attention-guided mechanism to dynamically combine the aggregated features with the semantic features. We comprehensively evaluate our DoFE framework on two fundus image segmentation tasks, including the optic cup and disc segmentation and vessel segmentation. Our DoFE framework generates satisfying segmentation results on unseen datasets and surpasses other domain generalization and network regularization methods.
The success of deep convolutional neural networks is partially attributed to the massive amount of annotated training data. However, in practice, medical data annotations are usually expensive and time-consuming to be obtained. Considering multi-modality data with the same anatomic structures are widely available in clinic routine, in this paper, we aim to exploit the prior knowledge (e.g., shape priors) learned from one modality (aka., assistant modality) to improve the segmentation performance on another modality (aka., target modality) to make up annotation scarcity. To alleviate the learning difficulties caused by modality-specific appearance discrepancy, we first present an Image Alignment Module (IAM) to narrow the appearance gap between assistant and target modality data.We then propose a novel Mutual Knowledge Distillation (MKD) scheme to thoroughly exploit the modality-shared knowledge to facilitate the target-modality segmentation. To be specific, we formulate our framework as an integration of two individual segmentors. Each segmentor not only explicitly extracts one modality knowledge from corresponding annotations, but also implicitly explores another modality knowledge from its counterpart in mutual-guided manner. The ensemble of two segmentors would further integrate the knowledge from both modalities and generate reliable segmentation results on target modality. Experimental results on the public multi-class cardiac segmentation data, i.e., MMWHS 2017, show that our method achieves large improvements on CT segmentation by utilizing additional MRI data and outperforms other state-of-the-art multi-modality learning methods.
Deep learning methods show promising results for overlapping cervical cell instance segmentation. However, in order to train a model with good generalization ability, voluminous pixel-level annotations are demanded which is quite expensive and time-consuming for acquisition. In this paper, we propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining (MMT-PSM), which consists of a teacher and a student network during training. Two networks are encouraged to be consistent both in feature and semantic level under small perturbations. The teacher's self-ensemble predictions from $K$-time augmented samples are used to construct the reliable pseudo-labels for optimizing the student. We design a novel strategy to estimate the sensitivity to perturbations for each proposal and select informative samples from massive cases to facilitate fast and effective semantic distillation. In addition, to eliminate the unavoidable noise from the background region, we propose to use the predicted segmentation mask as guidance to enforce the feature distillation in the foreground region. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only, and outperforms state-of-the-art semi-supervised methods.
The generalization capability of neural networks across domains is crucial for real-world applications. We argue that a generalized object recognition system should well understand the relationships among different images and also the images themselves at the same time. To this end, we present a new domain generalization framework that learns how to generalize across domains simultaneously from extrinsic relationship supervision and intrinsic self-supervision for images from multi-source domains. To be specific, we formulate our framework with feature embedding using a multi-task learning paradigm. Besides conducting the common supervised recognition task, we seamlessly integrate a momentum metric learning task and a self-supervised auxiliary task to collectively utilize the extrinsic supervision and intrinsic supervision. Also, we develop an effective momentum metric learning scheme with K-hard negative mining to boost the network to capture image relationship for domain generalization. We demonstrate the effectiveness of our approach on two standard object recognition benchmarks VLCS and PACS, and show that our methods achieve state-of-the-art performance.
Medical image annotations are prohibitively time-consuming and expensive to obtain. To alleviate annotation scarcity, many approaches have been developed to efficiently utilize extra information, e.g.,semi-supervised learning further exploring plentiful unlabeled data, domain adaptation including multi-modality learning and unsupervised domain adaptation resorting to the prior knowledge from additional modality. In this paper, we aim to investigate the feasibility of simultaneously leveraging abundant unlabeled data and well-established cross-modality data for annotation-efficient medical image segmentation. To this end, we propose a novel semi-supervised domain adaptation approach, namely Dual-Teacher, where the student model not only learns from labeled target data (e.g., CT), but also explores unlabeled target data and labeled source data (e.g., MR) by two teacher models. Specifically, the student model learns the knowledge of unlabeled target data from intra-domain teacher by encouraging prediction consistency, as well as the shape priors embedded in labeled source data from inter-domain teacher via knowledge distillation. Consequently, the student model can effectively exploit the information from all three data resources and comprehensively integrate them to achieve improved performance. We conduct extensive experiments on MM-WHS 2017 dataset and demonstrate that our approach is able to concurrently utilize unlabeled data and cross-modality data with superior performance, outperforming semi-supervised learning and domain adaptation methods with a large margin.