Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has emerged as a promising solution to this challenge as it can assist endoscopists in detecting and classifying overlooked polyps and abnormalities in real time. In addition to the algorithm's accuracy, transparency and interpretability are crucial to explaining the whys and hows of the algorithm's prediction. Further, most algorithms are developed in private data, closed source, or proprietary software, and methods lack reproducibility. Therefore, to promote the development of efficient and transparent methods, we have organized the "Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic. For the transparency task, a multi-disciplinary team, including expert gastroenterologists, accessed each submission and evaluated the team based on open-source practices, failure case analysis, ablation studies, usability and understandability of evaluations to gain a deeper understanding of the models' credibility for clinical deployment. Through the comprehensive analysis of the challenge, we not only highlight the advancements in polyp and surgical instrument segmentation but also encourage qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems.
Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand
Unlike existing fully-supervised approaches, we rethink colorectal polyp segmentation from an out-of-distribution perspective with a simple but effective self-supervised learning approach. We leverage the ability of masked autoencoders -- self-supervised vision transformers trained on a reconstruction task -- to learn in-distribution representations; here, the distribution of healthy colon images. We then perform out-of-distribution reconstruction and inference, with feature space standardisation to align the latent distribution of the diverse abnormal samples with the statistics of the healthy samples. We generate per-pixel anomaly scores for each image by calculating the difference between the input and reconstructed images and use this signal for out-of-distribution (ie, polyp) segmentation. Experimental results on six benchmarks show that our model has excellent segmentation performance and generalises across datasets. Our code is publicly available at https://github.com/GewelsJI/Polyp-OOD.
Segmenting anything is a ground-breaking step toward artificial general intelligence, and the Segment Anything Model (SAM) greatly fosters the foundation models for computer vision. We could not be more excited to probe the performance traits of SAM. In particular, exploring situations in which SAM does not perform well is interesting. In this report, we choose three concealed scenes, i.e., camouflaged animals, industrial defects, and medical lesions, to evaluate SAM under unprompted settings. Our main observation is that SAM looks unskilled in concealed scenes.
Concealed scene understanding (CSU) is a hot computer vision topic aiming to perceive objects with camouflaged properties. The current boom in its advanced techniques and novel applications makes it timely to provide an up-to-date survey to enable researchers to understand the global picture of the CSU field, including both current achievements and major challenges. This paper makes four contributions: (1) For the first time, we present a comprehensive survey of the deep learning techniques oriented at CSU, including a background with its taxonomy, task-unique challenges, and a review of its developments in the deep learning era via surveying existing datasets and deep techniques. (2) For a quantitative comparison of the state-of-the-art, we contribute the largest and latest benchmark for Concealed Object Segmentation (COS). (3) To evaluate the transferability of deep CSU in practical scenarios, we re-organize the largest concealed defect segmentation dataset termed CDS2K with the hard cases from diversified industrial scenarios, on which we construct a comprehensive benchmark. (4) We discuss open problems and potential research directions for this community. Our code and datasets are available at https://github.com/DengPingFan/CSU, which will be updated continuously to watch and summarize the advancements in this rapidly evolving field.
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
Recently CNN-based RGB-D salient object detection (SOD) has obtained significant improvement on detection accuracy. However, existing models often fail to perform well in terms of efficiency and accuracy simultaneously. This hinders their potential applications on mobile devices as well as many real-world problems. To bridge the accuracy gap between lightweight and large models for RGB-D SOD, in this paper, an efficient module that can greatly improve the accuracy but adds little computation is proposed. Inspired by the fact that depth quality is a key factor influencing the accuracy, we propose an efficient depth quality-inspired feature manipulation (DQFM) process, which can dynamically filter depth features according to depth quality. The proposed DQFM resorts to the alignment of low-level RGB and depth features, as well as holistic attention of the depth stream to explicitly control and enhance cross-modal fusion. We embed DQFM to obtain an efficient lightweight RGB-D SOD model called DFM-Net, where we in addition design a tailored depth backbone and a two-stage decoder as basic parts. Extensive experimental results on nine RGB-D datasets demonstrate that our DFM-Net outperforms recent efficient models, running at about 20 FPS on CPU with only 8.5Mb model size, and meanwhile being 2.9/2.4 times faster and 6.7/3.1 times smaller than the latest best models A2dele and MobileSal. It also maintains state-of-the-art accuracy when even compared to non-efficient models. Interestingly, further statistics and analyses verify the ability of DQFM in distinguishing depth maps of various qualities without any quality labels. Last but not least, we further apply DFM-Net to deal with video SOD (VSOD), achieving comparable performance against recent efficient models while being 3/2.3 times faster/smaller than the prior best in this field. Our code is available at https://github.com/zwbx/DFM-Net.
Camouflaged object detection (COD) aims to identify the objects that conceal themselves in natural scenes. Accurate COD suffers from a number of challenges associated with low boundary contrast and the large variation of object appearances, e.g., object size and shape. To address these challenges, we propose a novel Context-aware Cross-level Fusion Network (C2F-Net), which fuses context-aware cross-level features for accurately identifying camouflaged objects. Specifically, we compute informative attention coefficients from multi-level features with our Attention-induced Cross-level Fusion Module (ACFM), which further integrates the features under the guidance of attention coefficients. We then propose a Dual-branch Global Context Module (DGCM) to refine the fused features for informative feature representations by exploiting rich global context information. Multiple ACFMs and DGCMs are integrated in a cascaded manner for generating a coarse prediction from high-level features. The coarse prediction acts as an attention map to refine the low-level features before passing them to our Camouflage Inference Module (CIM) to generate the final prediction. We perform extensive experiments on three widely used benchmark datasets and compare C2F-Net with state-of-the-art (SOTA) models. The results show that C2F-Net is an effective COD model and outperforms SOTA models remarkably. Further, an evaluation on polyp segmentation datasets demonstrates the promising potentials of our C2F-Net in COD downstream applications. Our code is publicly available at: https://github.com/Ben57882/C2FNet-TSCVT.