Alert button
Picture for Meng Wei

Meng Wei

Alert button

OV-PARTS: Towards Open-Vocabulary Part Segmentation

Oct 08, 2023
Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xihui Liu, Jiangmiao Pang

Figure 1 for OV-PARTS: Towards Open-Vocabulary Part Segmentation
Figure 2 for OV-PARTS: Towards Open-Vocabulary Part Segmentation
Figure 3 for OV-PARTS: Towards Open-Vocabulary Part Segmentation
Figure 4 for OV-PARTS: Towards Open-Vocabulary Part Segmentation

Segmenting and recognizing diverse object parts is a crucial ability in applications spanning various computer vision and robotic tasks. While significant progress has been made in object-level Open-Vocabulary Semantic Segmentation (OVSS), i.e., segmenting objects with arbitrary text, the corresponding part-level research poses additional challenges. Firstly, part segmentation inherently involves intricate boundaries, while limited annotated data compounds the challenge. Secondly, part segmentation introduces an open granularity challenge due to the diverse and often ambiguous definitions of parts in the open world. Furthermore, the large-scale vision and language models, which play a key role in the open vocabulary setting, struggle to recognize parts as effectively as objects. To comprehensively investigate and tackle these challenges, we propose an Open-Vocabulary Part Segmentation (OV-PARTS) benchmark. OV-PARTS includes refined versions of two publicly available datasets: Pascal-Part-116 and ADE20K-Part-234. And it covers three specific tasks: Generalized Zero-Shot Part Segmentation, Cross-Dataset Part Segmentation, and Few-Shot Part Segmentation, providing insights into analogical reasoning, open granularity and few-shot adapting abilities of models. Moreover, we analyze and adapt two prevailing paradigms of existing object-level OVSS methods for OV-PARTS. Extensive experimental analysis is conducted to inspire future research in leveraging foundational models for OV-PARTS. The code and dataset are available at https://github.com/OpenRobotLab/OV_PARTS.

* Accepted by NeurIPS Dataset and Benchmark Track 2023 
Viaarxiv icon

Understanding Masked Autoencoders From a Local Contrastive Perspective

Oct 03, 2023
Xiaoyu Yue, Lei Bai, Meng Wei, Jiangmiao Pang, Xihui Liu, Luping Zhou, Wanli Ouyang

Figure 1 for Understanding Masked Autoencoders From a Local Contrastive Perspective
Figure 2 for Understanding Masked Autoencoders From a Local Contrastive Perspective
Figure 3 for Understanding Masked Autoencoders From a Local Contrastive Perspective
Figure 4 for Understanding Masked Autoencoders From a Local Contrastive Perspective

Masked AutoEncoder(MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we explore a new perspective to explain what truly contributes to the "rich hidden representations inside the MAE". Firstly, concerning MAE's generative pretraining pathway, with a unique encoder-decoder architecture to reconstruct images from aggressive masking, we conduct an in-depth analysis of the decoder's behaviors. We empirically find that MAE's decoder mainly learns local features with a limited receptive field, adhering to the well-known Locality Principle. Building upon this locality assumption, we propose a theoretical framework that reformulates the reconstruction-based MAE into a local region-level contrastive learning form for improved understanding. Furthermore, to substantiate the local contrastive nature of MAE, we introduce a Siamese architecture that combines the essence of MAE and contrastive learning without masking and explicit decoder, which sheds light on a unified and more flexible self-supervised learning framework.

Viaarxiv icon

SegMatch: A semi-supervised learning method for surgical instrument segmentation

Aug 09, 2023
Meng Wei, Charlie Budd, Luis C. Garcia-Peraza-Herrera, Reuben Dorent, Miaojing Shi, Tom Vercauteren

Figure 1 for SegMatch: A semi-supervised learning method for surgical instrument segmentation
Figure 2 for SegMatch: A semi-supervised learning method for surgical instrument segmentation
Figure 3 for SegMatch: A semi-supervised learning method for surgical instrument segmentation
Figure 4 for SegMatch: A semi-supervised learning method for surgical instrument segmentation

Surgical instrument segmentation is recognised as a key enabler to provide advanced surgical assistance and improve computer assisted interventions. In this work, we propose SegMatch, a semi supervised learning method to reduce the need for expensive annotation for laparoscopic and robotic surgical images. SegMatch builds on FixMatch, a widespread semi supervised classification pipeline combining consistency regularization and pseudo labelling, and adapts it for the purpose of segmentation. In our proposed SegMatch, the unlabelled images are weakly augmented and fed into the segmentation model to generate a pseudo-label to enforce the unsupervised loss against the output of the model for the adversarial augmented image on the pixels with a high confidence score. Our adaptation for segmentation tasks includes carefully considering the equivariance and invariance properties of the augmentation functions we rely on. To increase the relevance of our augmentations, we depart from using only handcrafted augmentations and introduce a trainable adversarial augmentation strategy. Our algorithm was evaluated on the MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 and EndoVis 2017. Our results demonstrate that adding unlabelled data for training purposes allows us to surpass the performance of fully supervised approaches which are limited by the availability of training data in these challenges. SegMatch also outperforms a range of state-of-the-art semi-supervised learning semantic segmentation models in different labelled to unlabelled data ratios.

* preprint under review, 12 pages, 7 figures 
Viaarxiv icon

Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation

Jul 22, 2023
Yuncheng Yang, Meng Wei, Junjun He, Jie Yang, Jin Ye, Yun Gu

Figure 1 for Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation
Figure 2 for Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation
Figure 3 for Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation
Figure 4 for Pick the Best Pre-trained Model: Towards Transferability Estimation for Medical Image Segmentation

Transfer learning is a critical technique in training deep neural networks for the challenging medical image segmentation task that requires enormous resources. With the abundance of medical image data, many research institutions release models trained on various datasets that can form a huge pool of candidate source models to choose from. Hence, it's vital to estimate the source models' transferability (i.e., the ability to generalize across different downstream tasks) for proper and efficient model reuse. To make up for its deficiency when applying transfer learning to medical image segmentation, in this paper, we therefore propose a new Transferability Estimation (TE) method. We first analyze the drawbacks of using the existing TE algorithms for medical image segmentation and then design a source-free TE framework that considers both class consistency and feature variety for better estimation. Extensive experiments show that our method surpasses all current algorithms for transferability estimation in medical image segmentation. Code is available at https://github.com/EndoluminalSurgicalVision-IMR/CCFV

* MICCAI2023(Early Accepted) 
Viaarxiv icon

In Defense of Clip-based Video Relation Detection

Jul 18, 2023
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

Figure 1 for In Defense of Clip-based Video Relation Detection
Figure 2 for In Defense of Clip-based Video Relation Detection
Figure 3 for In Defense of Clip-based Video Relation Detection
Figure 4 for In Defense of Clip-based Video Relation Detection

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

Viaarxiv icon

Text Promptable Surgical Instrument Segmentation with Vision-Language Models

Jun 15, 2023
Zijian Zhou, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, Miaojing Shi

Figure 1 for Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Figure 2 for Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Figure 3 for Text Promptable Surgical Instrument Segmentation with Vision-Language Models
Figure 4 for Text Promptable Surgical Instrument Segmentation with Vision-Language Models

In this paper, we propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments in minimally invasive surgeries. We redefine the task as text promptable, thereby enabling a more nuanced comprehension of surgical instruments and adaptability to new instrument types. Inspired by recent advancements in vision-language models, we leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder consisting of attention- and convolution-based prompting schemes for surgical instrument segmentation prediction. Our model leverages multiple text prompts for each surgical instrument through a new mixture of prompts mechanism, resulting in enhanced segmentation performance. Additionally, we introduce a hard instrument area reinforcement module to improve image feature comprehension and segmentation precision. Extensive experiments on EndoVis2017 and EndoVis2018 datasets demonstrate our model's superior performance and promising generalization capability. To our knowledge, this is the first implementation of a promptable approach to surgical instrument segmentation, offering significant potential for practical application in the field of robotic-assisted surgery.

Viaarxiv icon

Learning from Stochastic Labels

Feb 01, 2023
Meng Wei, Zhongnian Li, Yong Zhou, Qiaoyu Guo, Xinzheng Xu

Figure 1 for Learning from Stochastic Labels
Figure 2 for Learning from Stochastic Labels
Figure 3 for Learning from Stochastic Labels
Figure 4 for Learning from Stochastic Labels

Annotating multi-class instances is a crucial task in the field of machine learning. Unfortunately, identifying the correct class label from a long sequence of candidate labels is time-consuming and laborious. To alleviate this problem, we design a novel labeling mechanism called stochastic label. In this setting, stochastic label includes two cases: 1) identify a correct class label from a small number of randomly given labels; 2) annotate the instance with None label when given labels do not contain correct class label. In this paper, we propose a novel suitable approach to learn from these stochastic labels. We obtain an unbiased estimator that utilizes less supervised information in stochastic labels to train a multi-class classifier. Additionally, it is theoretically justifiable by deriving the estimation error bound of the proposed method. Finally, we conduct extensive experiments on widely-used benchmark datasets to validate the superiority of our method by comparing it with existing state-of-the-art methods.

Viaarxiv icon

Exploring Vanilla U-Net for Lesion Segmentation from Whole-body FDG-PET/CT Scans

Oct 14, 2022
Jin Ye, Haoyu Wang, Ziyan Huang, Zhongying Deng, Yanzhou Su, Can Tu, Qian Wu, Yuncheng Yang, Meng Wei, Jingqi Niu, Junjun He

Figure 1 for Exploring Vanilla U-Net for Lesion Segmentation from Whole-body FDG-PET/CT Scans
Figure 2 for Exploring Vanilla U-Net for Lesion Segmentation from Whole-body FDG-PET/CT Scans
Figure 3 for Exploring Vanilla U-Net for Lesion Segmentation from Whole-body FDG-PET/CT Scans
Figure 4 for Exploring Vanilla U-Net for Lesion Segmentation from Whole-body FDG-PET/CT Scans

Tumor lesion segmentation is one of the most important tasks in medical image analysis. In clinical practice, Fluorodeoxyglucose Positron-Emission Tomography~(FDG-PET) is a widely used technique to identify and quantify metabolically active tumors. However, since FDG-PET scans only provide metabolic information, healthy tissue or benign disease with irregular glucose consumption may be mistaken for cancer. To handle this challenge, PET is commonly combined with Computed Tomography~(CT), with the CT used to obtain the anatomic structure of the patient. The combination of PET-based metabolic and CT-based anatomic information can contribute to better tumor segmentation results. %Computed tomography~(CT) is a popular modality to illustrate the anatomic structure of the patient. The combination of PET and CT is promising to handle this challenge by utilizing metabolic and anatomic information. In this paper, we explore the potential of U-Net for lesion segmentation in whole-body FDG-PET/CT scans from three aspects, including network architecture, data preprocessing, and data augmentation. The experimental results demonstrate that the vanilla U-Net with proper input shape can achieve satisfactory performance. Specifically, our method achieves first place in both preliminary and final leaderboards of the autoPET 2022 challenge. Our code is available at https://github.com/Yejin0111/autoPET2022_Blackbean.

* autoPET 2022, MICCAI 2022 challenge, champion 
Viaarxiv icon

Class-Imbalanced Complementary-Label Learning via Weighted Loss

Sep 28, 2022
Meng Wei, Yong Zhou, Zhongnian Li, Xinzheng Xu

Figure 1 for Class-Imbalanced Complementary-Label Learning via Weighted Loss
Figure 2 for Class-Imbalanced Complementary-Label Learning via Weighted Loss
Figure 3 for Class-Imbalanced Complementary-Label Learning via Weighted Loss
Figure 4 for Class-Imbalanced Complementary-Label Learning via Weighted Loss

Complementary-label learning (CLL) is a common application in the scenario of weak supervision. However, in real-world datasets, CLL encounters class-imbalanced training samples, where the quantity of samples of one class is significantly lower than those of other classes. Unfortunately, existing CLL approaches have yet to explore the problem of class-imbalanced samples, which reduces the prediction accuracy, especially in imbalanced classes. In this paper, we propose a novel problem setting to allow learning from class-imbalanced complementarily labeled samples for multi-class classification. Accordingly, to deal with this novel problem, we propose a new CLL approach, called Weighted Complementary-Label Learning (WCLL). The proposed method models a weighted empirical risk minimization loss by utilizing the class-imbalanced complementarily labeled information, which is also applicable to multi-class imbalanced training samples. Furthermore, the estimation error bound of the proposed method was derived to provide a theoretical guarantee. Finally, we do extensive experiments on widely-used benchmark datasets to validate the superiority of our method by comparing it with existing state-of-the-art methods.

* 9 pages, 9 figures, 3 tables 
Viaarxiv icon

Counting with Adaptive Auxiliary Learning

Mar 08, 2022
Yanda Meng, Joshua Bridge, Meng Wei, Yitian Zhao, Yihong Qiao, Xiaoyun Yang, Xiaowei Huang, Yalin Zheng

Figure 1 for Counting with Adaptive Auxiliary Learning
Figure 2 for Counting with Adaptive Auxiliary Learning
Figure 3 for Counting with Adaptive Auxiliary Learning
Figure 4 for Counting with Adaptive Auxiliary Learning

This paper proposes an adaptive auxiliary task learning based approach for object counting problems. Unlike existing auxiliary task learning based methods, we develop an attention-enhanced adaptively shared backbone network to enable both task-shared and task-tailored features learning in an end-to-end manner. The network seamlessly combines standard Convolution Neural Network (CNN) and Graph Convolution Network (GCN) for feature extraction and feature reasoning among different domains of tasks. Our approach gains enriched contextual information by iteratively and hierarchically fusing the features across different task branches of the adaptive CNN backbone. The whole framework pays special attention to the objects' spatial locations and varied density levels, informed by object (or crowd) segmentation and density level segmentation auxiliary tasks. In particular, thanks to the proposed dilated contrastive density loss function, our network benefits from individual and regional context supervision in terms of pixel-independent and pixel-dependent feature learning mechanisms, along with strengthened robustness. Experiments on seven challenging multi-domain datasets demonstrate that our method achieves superior performance to the state-of-the-art auxiliary task learning based counting methods. Our code is made publicly available at: https://github.com/smallmax00/Counting_With_Adaptive_Auxiliary

Viaarxiv icon