Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lu Yao

GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Jan 13, 2026

Yan Zhu, Te Luo, Pei-Yao Fu, Zhen Zhang, Zi-Long Wang, Yi-Fan Qu, Zi-Han Geng, Jia-Qi Xu, Lu Yao, Li-Yun Ma(+5 more)

Abstract:Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features.GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.

* 45 pages, 17 figures, 6 tables. Leaderboard available at: https://roterdl.github.io/GIBench/ . Includes supplementary material

Via

Access Paper or Ask Questions

CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention

Jul 31, 2021

Wenxiao Wang, Lu Yao, Long Chen, Deng Cai, Xiaofei He, Wei Liu

Figure 1 for CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention

Figure 2 for CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention

Figure 3 for CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention

Figure 4 for CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention

Abstract:Transformers have made much progress in dealing with visual tasks. However, existing vision transformers still do not possess an ability that is important to visual input: building the attention among features of different scales. The reasons for this problem are two-fold: (1) Input embeddings of each layer are equal-scale without cross-scale features; (2) Some vision transformers sacrifice the small-scale features of embeddings to lower the cost of the self-attention module. To make up this defect, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). In particular, CEL blends each embedding with multiple patches of different scales, providing the model with cross-scale embeddings. LSDA splits the self-attention module into a short-distance and long-distance one, also lowering the cost but keeping both small-scale and large-scale features in embeddings. Through these two designs, we achieve cross-scale attention. Besides, we propose dynamic position bias for vision transformers to make the popular relative position bias apply to variable-sized images. Based on these proposed modules, we construct our vision architecture called CrossFormer. Experiments show that CrossFormer outperforms other transformers on several representative visual tasks, especially object detection and segmentation. The code has been released: https://github.com/cheerss/CrossFormer.

* 13 pages, 4 figures, and 9 tables

Via

Access Paper or Ask Questions

Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks

May 03, 2021

Yunwen Chen, Zuotao Liu, Daqi Ji, Yingwei Xin, Wenguang Wang, Lu Yao, Yi Zou

Figure 1 for Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks

Figure 2 for Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks

Figure 3 for Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks

Figure 4 for Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks

Abstract:This paper describes the solution of Shanda Innovations team to Task 1 of KDD-Cup 2012. A novel approach called Multifaceted Factorization Models is proposed to incorporate a great variety of features in social networks. Social relationships and actions between users are integrated as implicit feedbacks to improve the recommendation accuracy. Keywords, tags, profiles, time and some other features are also utilized for modeling user interests. In addition, user behaviors are modeled from the durations of recommendation records. A context-aware ensemble framework is then applied to combine multiple predictors and produce final recommendation results. The proposed approach obtained 0.43959 (public score) / 0.41874 (private score) on the testing dataset, which achieved the 2nd place in the KDD-Cup competition.

* KDD 2012

Via

Access Paper or Ask Questions