Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Doanh C. Bui

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Nov 17, 2025

Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen, Yasuhiko Nakashima

Abstract:Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

* WACV2026 Accepted

Via

Access Paper or Ask Questions

Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

May 04, 2025

Doanh C. Bui, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Duy Tran, Khang Nguyen, Yasuhiko Nakashima

Figure 1 for Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Figure 2 for Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Figure 3 for Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Figure 4 for Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Abstract:Whole Slide Images (WSIs) play a crucial role in accurate cancer diagnosis and prognosis, as they provide tissue details at the cellular level. However, the rapid growth of computational tasks involving WSIs poses significant challenges. Given that WSIs are gigapixels in size, they present difficulties in terms of storage, processing, and model training. Therefore, it is essential to develop lifelong learning approaches for WSI analysis. In scenarios where slides are distributed across multiple institutes, we aim to leverage them to develop a unified online model as a computational tool for cancer diagnosis in clinical and hospital settings. In this study, we introduce ADaFGrad, a method designed to enhance lifelong learning for whole-slide image (WSI) analysis. First, we leverage pathology vision-language foundation models to develop a framework that enables interaction between a slide's regional tissue features and a predefined text-based prototype buffer. Additionally, we propose a gradient-distillation mechanism that mimics the gradient of a logit with respect to the classification-head parameters across past and current iterations in a continual-learning setting. We construct a sequence of six TCGA datasets for training and evaluation. Experimental results show that ADaFGrad outperforms both state-of-the-art WSI-specific and conventional continual-learning methods after only a few training epochs, exceeding them by up to +5.068% in the class-incremental learning scenario while exhibiting the least forgetting (i.e., retaining the most knowledge from previous tasks). Moreover, ADaFGrad surpasses its baseline by as much as +40.084% in accuracy, further demonstrating the effectiveness of the proposed modules.

Via

Access Paper or Ask Questions

ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

Apr 22, 2025

Doanh C. Bui, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Duy Tran, Yasuhiko Nakashima

Abstract:Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks are defined is time-consuming. Recent work has applied regularization- and rehearsal-based methods to this setting. However, the rise of vision-language foundation models that align diagnostic text with pathology images raises the question: are these models alone sufficient for lifelong WSI learning using zero-shot classification, or is further investigation into continual learning strategies needed to improve performance? To our knowledge, this is the first study to compare conventional continual-learning approaches with vision-language zero-shot classification for WSIs. Our source code and experimental results will be available soon.

* 10 pages, 3 figures, 1 table, conference submission

Via

Access Paper or Ask Questions

HiGDA: Hierarchical Graph of Nodes to Learn Local-to-Global Topology for Semi-Supervised Domain Adaptation

Dec 16, 2024

Ba Hung Ngo, Doanh C. Bui, Nhat-Tuong Do-Tran, Tae Jong Choi

Abstract:The enhanced representational power and broad applicability of deep learning models have attracted significant interest from the research community in recent years. However, these models often struggle to perform effectively under domain shift conditions, where the training data (the source domain) is related to but exhibits different distributions from the testing data (the target domain). To address this challenge, previous studies have attempted to reduce the domain gap between source and target data by incorporating a few labeled target samples during training - a technique known as semi-supervised domain adaptation (SSDA). While this strategy has demonstrated notable improvements in classification performance, the network architectures used in these approaches primarily focus on exploiting the features of individual images, leaving room for improvement in capturing rich representations. In this study, we introduce a Hierarchical Graph of Nodes designed to simultaneously present representations at both feature and category levels. At the feature level, we introduce a local graph to identify the most relevant patches within an image, facilitating adaptability to defined main object representations. At the category level, we employ a global graph to aggregate the features from samples within the same category, thereby enriching overall representations. Extensive experiments on widely used SSDA benchmark datasets, including Office-Home, DomainNet, and VisDA2017, demonstrate that both quantitative and qualitative results substantiate the effectiveness of HiGDA, establishing it as a new state-of-the-art method.

* Accepted for presentation at AAAI2025

Via

Access Paper or Ask Questions

MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network

Oct 06, 2024

Doanh C. Bui, Jin Tae Kwak

Figure 1 for MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network

Figure 2 for MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network

Figure 3 for MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network

Figure 4 for MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network

Abstract:Whole slide image (WSI) classification is a crucial problem for cancer diagnostics in clinics and hospitals. A WSI, acquired at gigapixel size, is commonly tiled into patches and processed by multiple-instance learning (MIL) models. Previous MIL-based models designed for this problem have only been evaluated on individual tasks for specific organs, and the ability to handle multiple tasks within a single model has not been investigated. In this study, we propose MECFormer, a generative Transformer-based model designed to handle multiple tasks within one model. To leverage the power of learning multiple tasks simultaneously and to enhance the model's effectiveness in focusing on each individual task, we introduce an Expert Consultation Network, a projection layer placed at the beginning of the Transformer-based model. Additionally, to enable flexible classification, autoregressive decoding is incorporated by a language decoder for WSI classification. Through extensive experiments on five datasets involving four different organs, one cancer classification task, and four cancer subtyping tasks, MECFormer demonstrates superior performance compared to individual state-of-the-art multiple-instance learning models.

* Accepted for presentation at ACCV2024

Via

Access Paper or Ask Questions

QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Jul 18, 2024

Trinh T. L. Vuong, Doanh C. Bui, Jin Tae Kwak

Figure 1 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Figure 2 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Figure 3 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Figure 4 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Abstract:In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.

* MICCAI-Thompson Challenge 2023

Via

Access Paper or Ask Questions

FALFormer: Feature-aware Landmarks self-attention for Whole-slide Image Classification

Jul 11, 2024

Doanh C. Bui, Trinh Thi Le Vuong, Jin Tae Kwak

Abstract:Slide-level classification for whole-slide images (WSIs) has been widely recognized as a crucial problem in digital and computational pathology. Current approaches commonly consider WSIs as a bag of cropped patches and process them via multiple instance learning due to the large number of patches, which cannot fully explore the relationship among patches; in other words, the global information cannot be fully incorporated into decision making. Herein, we propose an efficient and effective slide-level classification model, named as FALFormer, that can process a WSI as a whole so as to fully exploit the relationship among the entire patches and to improve the classification performance. FALFormer is built based upon Transformers and self-attention mechanism. To lessen the computational burden of the original self-attention mechanism and to process the entire patches together in a WSI, FALFormer employs Nystr\"om self-attention which approximates the computation by using a smaller number of tokens or landmarks. For effective learning, FALFormer introduces feature-aware landmarks to enhance the representation power of the landmarks and the quality of the approximation. We systematically evaluate the performance of FALFormer using two public datasets, including CAMELYON16 and TCGA-BRCA. The experimental results demonstrate that FALFormer achieves superior performance on both datasets, outperforming the state-of-the-art methods for the slide-level classification. This suggests that FALFormer can facilitate an accurate and precise analysis of WSIs, potentially leading to improved diagnosis and prognosis on WSIs.

* 10 pages, 2 figures

Via

Access Paper or Ask Questions

CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Mar 10, 2024

Doanh C. Bui, Thinh V. Le, Hung Ba Ngo, Tae Jong Choi

Figure 1 for CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Figure 2 for CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Figure 3 for CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Figure 4 for CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Abstract:Person attribute recognition and attribute-based retrieval are two core human-centric tasks. In the recognition task, the challenge is specifying attributes depending on a person's appearance, while the retrieval task involves searching for matching persons based on attribute queries. There is a significant relationship between recognition and retrieval tasks. In this study, we demonstrate that if there is a sufficiently robust network to solve person attribute recognition, it can be adapted to facilitate better performance for the retrieval task. Another issue that needs addressing in the retrieval task is the modality gap between attribute queries and persons' images. Therefore, in this paper, we present CLEAR, a unified network designed to address both tasks. We introduce a robust cross-transformers network to handle person attribute recognition. Additionally, leveraging a pre-trained language model, we construct pseudo-descriptions for attribute queries and introduce an effective training strategy to train only a few additional parameters for adapters, facilitating the handling of the retrieval task. Finally, the unified CLEAR model is evaluated on five benchmarks: PETA, PA100K, Market-1501, RAPv2, and UPAR-2024. Without bells and whistles, CLEAR achieves state-of-the-art performance or competitive results for both tasks, significantly outperforming other competitors in terms of person retrieval performance on the widely-used Market-1501 dataset.

Via

Access Paper or Ask Questions

UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

May 09, 2023

Doanh C. Bui, Nghia Hieu Nguyen, Khang Nguyen

Figure 1 for UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Figure 2 for UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Figure 3 for UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Figure 4 for UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Abstract:Image Captioning is one of the vision-language tasks that still interest the research community worldwide in the 2020s. MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, although it was published in 2015. Recent captioning models trained on the MS-COCO Caption dataset only have good performance in language patterns of English; they do not have such good performance in contexts captured in Vietnam or fluently caption images using Vietnamese. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation process. From preliminary analysis, we show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC has room to grow, which can be one of the standard benchmarks in Vietnamese for the research community to evaluate their captioning models. Furthermore, we present a CAMO approach that effectively enhances the image representation ability by a multi-level encoder output fusion mechanism, which helps improve the quality of generated captions compared to previous captioning models.

* 10 pages, 7 figures, submitted to Elsevier

Via

Access Paper or Ask Questions