Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Trinh T. L. Vuong

VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

May 07, 2025

Trinh T. L. Vuong, Jin Tae Kwak

Abstract:We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.

Via

Access Paper or Ask Questions

QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Jul 18, 2024

Trinh T. L. Vuong, Doanh C. Bui, Jin Tae Kwak

Figure 1 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Figure 2 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Figure 3 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Figure 4 for QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View

Abstract:In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.

* MICCAI-Thompson Challenge 2023

Via

Access Paper or Ask Questions