Alert button
Picture for Yanfeng Wang

Yanfeng Wang

Alert button

Long-Tailed Partial Label Learning via Dynamic Rebalancing

Feb 10, 2023
Feng Hong, Jiangchao Yao, Zhihan Zhou, Ya Zhang, Yanfeng Wang

Figure 1 for Long-Tailed Partial Label Learning via Dynamic Rebalancing
Figure 2 for Long-Tailed Partial Label Learning via Dynamic Rebalancing
Figure 3 for Long-Tailed Partial Label Learning via Dynamic Rebalancing
Figure 4 for Long-Tailed Partial Label Learning via Dynamic Rebalancing

Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced in long-tailed context. We show that even with the auxiliary of an oracle class prior, the state-of-the-art methods underperform due to an adverse fact that the constant rebalancing in LT is harsh to the label disambiguation in PLL. To overcome this challenge, we thus propose a dynamic rebalancing method, termed as RECORDS, without assuming any prior knowledge about the class distribution. Based on a parametric decomposition of the biased output, our method constructs a dynamic adjustment that is benign to the label disambiguation process and theoretically converges to the oracle class prior. Extensive experiments on three benchmark datasets demonstrate the significant gain of RECORDS compared with a range of baselines. The code is publicly available.

* ICLR 2023 
Viaarxiv icon

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

Jan 12, 2023
Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

The goal of this paper is to augment a pre-trained text-to-image diffusion model with the ability of open-vocabulary objects grounding, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we insert a grounding module into the existing diffusion model, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we propose an automatic pipeline for constructing a dataset, that consists of {image, segmentation mask, text prompt} triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time; (iv) we adopt the guided diffusion model to build a synthetic semantic segmentation dataset, and show that training a standard segmentation model on such dataset demonstrates competitive performance on zero-shot segmentation(ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks.

Viaarxiv icon

Integrating features from lymph node stations for metastatic lymph node detection

Jan 09, 2023
Chaoyi Wu, Feng Chang, Xiao Su, Zhihan Wu, Yanfeng Wang, Ling Zhu, Ya Zhang

Figure 1 for Integrating features from lymph node stations for metastatic lymph node detection
Figure 2 for Integrating features from lymph node stations for metastatic lymph node detection
Figure 3 for Integrating features from lymph node stations for metastatic lymph node detection
Figure 4 for Integrating features from lymph node stations for metastatic lymph node detection

Metastasis on lymph nodes (LNs), the most common way of spread for primary tumor cells, is a sign of increased mortality. However, metastatic LNs are time-consuming and challenging to detect even for professional radiologists due to their small sizes, high sparsity, and ambiguity in appearance. It is desired to leverage recent development in deep learning to automatically detect metastatic LNs. Besides a two-stage detection network, we here introduce an additional branch to leverage information about LN stations, an important reference for radiologists during metastatic LN diagnosis, as supplementary information for metastatic LN detection. The branch targets to solve a closely related task on the LN station level, i.e., classifying whether an LN station contains metastatic LN or not, so as to learn representations for LN stations. Considering that a metastatic LN station is expected to significantly affect the nearby ones, a GCN-based structure is adopted by the branch to model the relationship among different LN stations. At the classification stage of metastatic LN detection, the above learned LN station features, as well as the features reflecting the distance between the LN candidate and the LN stations, are integrated with the LN features. We validate our method on a dataset containing 114 intravenous contrast-enhanced Computed Tomography (CT) images of oral squamous cell carcinoma (OSCC) patients and show that it outperforms several state-of-the-art methods on the mFROC, maxF1, and AUC scores,respectively.

* Computerized Medical Imaging and Graphics, Volume 101, 2022, 102108, ISSN 0895-6111  
Viaarxiv icon

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training

Jan 05, 2023
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

Figure 1 for MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training
Figure 2 for MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training
Figure 3 for MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training
Figure 4 for MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training

In this paper, we consider the problem of enhancing self-supervised visual-language pre-training (VLP) with medical-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel report filter to extract the medical entities, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a novel entity embedding module by querying an external knowledge description base, to exploit the rich context of additional information that the medical domain affords, and implicitly build relationships between entities in the language embedding space; Third, we propose a novel Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level only with self-supervised learning, thus enabling the ability for spatial grounding; Fourth, we conduct thorough experiments to validate the effectiveness of our proposed architecture, and benchmark on numerous public benchmarks e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding.

Viaarxiv icon

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Dec 19, 2022
Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Yanfeng Wang, Qi Tian

Figure 1 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization
Figure 2 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization
Figure 3 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization
Figure 4 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.

* The first two authors share the same contribution 
Viaarxiv icon

FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation

Dec 14, 2022
Ziqing Fan, Yanfeng Wang, Jiangchao Yao, Lingjuan Lyu, Ya Zhang, Qi Tian

Figure 1 for FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation
Figure 2 for FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation
Figure 3 for FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation
Figure 4 for FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation

The statistical heterogeneity of the non-independent and identically distributed (non-IID) data in local clients significantly limits the performance of federated learning. Previous attempts like FedProx, SCAFFOLD, MOON, FedNova and FedDyn resort to an optimization perspective, which requires an auxiliary term or re-weights local updates to calibrate the learning bias or the objective inconsistency. However, in addition to previous explorations for improvement in federated averaging, our analysis shows that another critical bottleneck is the poorer optima of client models in more heterogeneous conditions. We thus introduce a data-driven approach called FedSkip to improve the client optima by periodically skipping federated averaging and scattering local models to the cross devices. We provide theoretical analysis of the possible benefit from FedSkip and conduct extensive experiments on a range of datasets to demonstrate that FedSkip achieves much higher accuracy, better aggregation efficiency and competing communication efficiency. Source code is available at: https://github.com/MediaBrain-SJTU/FedSkip.

Viaarxiv icon

Robust Collaborative 3D Object Detection in Presence of Pose Errors

Nov 15, 2022
Yifan Lu, Quanhao Li, Baoan Liu, Mehrdad Dianati, Chen Feng, Siheng Chen, Yanfeng Wang

Figure 1 for Robust Collaborative 3D Object Detection in Presence of Pose Errors
Figure 2 for Robust Collaborative 3D Object Detection in Presence of Pose Errors
Figure 3 for Robust Collaborative 3D Object Detection in Presence of Pose Errors
Figure 4 for Robust Collaborative 3D Object Detection in Presence of Pose Errors

Collaborative 3D object detection exploits information exchange among multiple agents to enhance accuracy of object detection in presence of sensor impairments such as occlusion. However, in practice, pose estimation errors due to imperfect localization would cause spatial message misalignment and significantly reduce the performance of collaboration. To alleviate adverse impacts of pose errors, we propose CoAlign, a novel hybrid collaboration framework that is robust to unknown pose errors. The proposed solution relies on a novel agent-object pose graph modeling to enhance pose consistency among collaborating agents. Furthermore, we adopt a multi-scale data fusion strategy to aggregate intermediate features at multiple spatial resolutions. Comparing with previous works, which require ground-truth pose for training supervision, our proposed CoAlign is more practical since it doesn't require any ground-truth pose supervision in the training and makes no specific assumptions on pose errors. Extensive evaluation of the proposed method is carried out on multiple datasets, certifying that CoAlign significantly reduce relative localization error and achieving the state of art detection performance when pose errors exist. Code are made available for the use of the research community at https://github.com/yifanlu0227/CoAlign.

Viaarxiv icon

Unrolled Graph Learning for Multi-Agent Collaboration

Oct 31, 2022
Enpei Zhang, Shuo Tang, Xiaowen Dong, Siheng Chen, Yanfeng Wang

Figure 1 for Unrolled Graph Learning for Multi-Agent Collaboration
Figure 2 for Unrolled Graph Learning for Multi-Agent Collaboration
Figure 3 for Unrolled Graph Learning for Multi-Agent Collaboration
Figure 4 for Unrolled Graph Learning for Multi-Agent Collaboration

Multi-agent learning has gained increasing attention to tackle distributed machine learning scenarios under constrictions of data exchanging. However, existing multi-agent learning models usually consider data fusion under fixed and compulsory collaborative relations among agents, which is not as flexible and autonomous as human collaboration. To fill this gap, we propose a distributed multi-agent learning model inspired by human collaboration, in which the agents can autonomously detect suitable collaborators and refer to collaborators' model for better performance. To implement such adaptive collaboration, we use a collaboration graph to indicate the pairwise collaborative relation. The collaboration graph can be obtained by graph learning techniques based on model similarity between different agents. Since model similarity can not be formulated by a fixed graphical optimization, we design a graph learning network by unrolling, which can learn underlying similar features among potential collaborators. By testing on both regression and classification tasks, we validate that our proposed collaboration model can figure out accurate collaborative relationship and greatly improve agents' learning performance.

Viaarxiv icon

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

Oct 27, 2022
Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, Weidi Xie

Figure 1 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Figure 2 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Figure 3 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Figure 4 for Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

* BMVC 2022 Oral 
Viaarxiv icon

Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation

Oct 18, 2022
Yangheng Zhao, Jun Wang, Xiaolong Li, Yue Hu, Ce Zhang, Yanfeng Wang, Siheng Chen

Figure 1 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
Figure 2 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
Figure 3 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
Figure 4 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation

3D point cloud semantic segmentation is one of the fundamental tasks for 3D scene understanding and has been widely used in the metaverse applications. Many recent 3D semantic segmentation methods learn a single prototype (classifier weights) for each semantic class, and classify 3D points according to their nearest prototype. However, learning only one prototype for each class limits the model's ability to describe the high variance patterns within a class. Instead of learning a single prototype for each class, in this paper, we propose to use an adaptive number of prototypes to dynamically describe the different point patterns within a semantic class. With the powerful capability of vision transformer, we design a Number-Adaptive Prototype Learning (NAPL) model for point cloud semantic segmentation. To train our NAPL model, we propose a simple yet effective prototype dropout training strategy, which enables our model to adaptively produce prototypes for each class. The experimental results on SemanticKITTI dataset demonstrate that our method achieves 2.3% mIoU improvement over the baseline model based on the point-wise classification paradigm.

Viaarxiv icon