Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weidi Xie

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Feb 21, 2025

Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman

Abstract:The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP/SigLIP and the state-of-the-art BLIP-2 architectures. To train the architecture with limited computing resources, we develop a 'student friendly' best practice involving global hard sample mining, and selection and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. Benefiting from the novel architecture and data curation, experiments show our enhanced network significantly boosts CLIP/SigLIP performance and outperforms the state-of-the-art BLIP-2 model on text-to-image retrieval.

Via

Access Paper or Ask Questions

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Feb 06, 2025

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie

Figure 1 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Figure 2 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Figure 3 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Figure 4 for WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Abstract:In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

Via

Access Paper or Ask Questions

Track-On: Transformer-based Online Point Tracking with Memory

Jan 30, 2025

Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Güney

Figure 1 for Track-On: Transformer-based Online Point Tracking with Memory

Figure 2 for Track-On: Transformer-based Online Point Tracking with Memory

Figure 3 for Track-On: Transformer-based Online Point Tracking with Memory

Figure 4 for Track-On: Transformer-based Online Point Tracking with Memory

Abstract:In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on

* ICLR 2025

Via

Access Paper or Ask Questions

A Knowledge-enhanced Pathology Vision-language Foundation Model for Cancer Diagnosis

Dec 17, 2024

Xiao Zhou, Luoyi Sun, Dexuan He, Wenbin Guan, Ruifen Wang, Lifeng Wang, Xin Sun, Kun Sun, Ya Zhang, Yanfeng Wang(+1 more)

Abstract:Deep learning has enabled the development of highly robust foundation models for various pathological tasks across diverse diseases and patient cohorts. Among these models, vision-language pre-training, which leverages large-scale paired data to align pathology image and text embedding spaces, and provides a novel zero-shot paradigm for downstream tasks. However, existing models have been primarily data-driven and lack the incorporation of domain-specific knowledge, which limits their performance in cancer diagnosis, especially for rare tumor subtypes. To address this limitation, we establish a Knowledge-enhanced Pathology (KEEP) foundation model that harnesses disease knowledge to facilitate vision-language pre-training. Specifically, we first construct a disease knowledge graph (KG) that covers 11,454 human diseases with 139,143 disease attributes, including synonyms, definitions, and hypernym relations. We then systematically reorganize the millions of publicly available noisy pathology image-text pairs, into 143K well-structured semantic groups linked through the hierarchical relations of the disease KG. To derive more nuanced image and text representations, we propose a novel knowledge-enhanced vision-language pre-training approach that integrates disease knowledge into the alignment within hierarchical semantic groups instead of unstructured image-text pairs. Validated on 18 diverse benchmarks with more than 14,000 whole slide images (WSIs), KEEP achieves state-of-the-art performance in zero-shot cancer diagnostic tasks. Notably, for cancer detection, KEEP demonstrates an average sensitivity of 89.8% at a specificity of 95.0% across 7 cancer types. For cancer subtyping, KEEP achieves a median balanced accuracy of 0.456 in subtyping 30 rare brain cancers, indicating strong generalizability for diagnosing rare tumors.

Via

Access Paper or Ask Questions

Can Modern LLMs Act as Agent Cores in Radiology~Environments?

Dec 12, 2024

Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie

Abstract:Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. Radiology, with its complex analytical requirements, is an ideal field for the application of these agents. This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?' To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. All of our code and data are open-sourced in https://github.com/MAGIC-AI4Med/RadABench.

* 22 pages,7 figures

Via

Access Paper or Ask Questions

MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Dec 04, 2024

Haoning Wu, Ziheng Zhao, Ya Zhang, Weidi Xie, Yanfeng Wang

Figure 1 for MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Figure 2 for MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Figure 3 for MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Figure 4 for MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Abstract:Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: (i) we collect and curate a large-scale radiology image-text dataset, MedGen-1M, comprising modality labels, attributes, region, and organ information, along with a subset of organ mask annotations, to support research in controllable medical image generation; (ii) we propose a diffusion-based data engine, termed MRGen, which enables generation conditioned on text prompts and masks, synthesizing MR images for diverse modalities lacking mask annotations, to train segmentation models on unannotated modalities; (iii) we conduct extensive experiments across various modalities, illustrating that our data engine can effectively synthesize training samples and extend MRI segmentation towards unannotated modalities.

* Technical Report; Project Page: https://haoningwu3639.github.io/MRGen/

Via

Access Paper or Ask Questions

Unlocking Video-LLM via Agent-of-Thoughts Distillation

Dec 02, 2024

Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie

Figure 1 for Unlocking Video-LLM via Agent-of-Thoughts Distillation

Figure 2 for Unlocking Video-LLM via Agent-of-Thoughts Distillation

Figure 3 for Unlocking Video-LLM via Agent-of-Thoughts Distillation

Figure 4 for Unlocking Video-LLM via Agent-of-Thoughts Distillation

Abstract:This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

Via

Access Paper or Ask Questions

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Dec 02, 2024

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie

Figure 1 for LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Figure 2 for LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Figure 3 for LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Figure 4 for LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Abstract:With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.

Via

Access Paper or Ask Questions

LoRKD: Low-Rank Knowledge Decomposition for Medical Foundation Models

Sep 29, 2024

Haolin Li, Yuhang Zhou, Ziheng Zhao, Siyuan Du, Jiangchao Yao, Weidi Xie, Ya Zhang, Yanfeng Wang

Figure 1 for LoRKD: Low-Rank Knowledge Decomposition for Medical Foundation Models

Figure 2 for LoRKD: Low-Rank Knowledge Decomposition for Medical Foundation Models

Figure 3 for LoRKD: Low-Rank Knowledge Decomposition for Medical Foundation Models

Figure 4 for LoRKD: Low-Rank Knowledge Decomposition for Medical Foundation Models

Abstract:The widespread adoption of large-scale pre-training techniques has significantly advanced the development of medical foundation models, enabling them to serve as versatile tools across a broad range of medical tasks. However, despite their strong generalization capabilities, medical foundation models pre-trained on large-scale datasets tend to suffer from domain gaps between heterogeneous data, leading to suboptimal performance on specific tasks compared to specialist models, as evidenced by previous studies. In this paper, we explore a new perspective called "Knowledge Decomposition" to improve the performance on specific medical tasks, which deconstructs the foundation model into multiple lightweight expert models, each dedicated to a particular anatomical region, with the aim of enhancing specialization and simultaneously reducing resource consumption. To accomplish the above objective, we propose a novel framework named Low-Rank Knowledge Decomposition (LoRKD), which explicitly separates gradients from different tasks by incorporating low-rank expert modules and efficient knowledge separation convolution. The low-rank expert modules resolve gradient conflicts between heterogeneous data from different anatomical regions, providing strong specialization at lower costs. The efficient knowledge separation convolution significantly improves algorithm efficiency by achieving knowledge separation within a single forward propagation. Extensive experimental results on segmentation and classification tasks demonstrate that our decomposed models not only achieve state-of-the-art performance but also exhibit superior transferability on downstream tasks, even surpassing the original foundation models in task-specific evaluations. The code is available at here.

* The paper is an extended version of our conference paper published on CVPR 2024. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Aug 26, 2024

Qirui Chen, Shangzhe Di, Weidi Xie

Figure 1 for Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Figure 2 for Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Figure 3 for Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Figure 4 for Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Abstract:This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models (MLLMs) by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a new baseline for this challenging task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

Via

Access Paper or Ask Questions