Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhi Liu

Department of Mathematical and Systems Engineering, Shizuoka University, Japan

Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

Jan 06, 2025

Hongyi Miao, Jun Jia, Yankun Cao, Yingjie Zhou, Yanwei Jiang, Zhi Liu, Guangtao Zhai

Figure 1 for Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

Figure 2 for Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

Figure 3 for Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

Figure 4 for Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

Abstract:With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.

Via

Access Paper or Ask Questions

Evaluating the Design Features of an Intelligent Tutoring System for Advanced Mathematics Learning

Dec 23, 2024

Ying Fang, Bo He, Zhi Liu, Sannyuya Liu, Zhonghua Yan, Jianwen Sun

Abstract:Xiaomai is an intelligent tutoring system (ITS) designed to help Chinese college students in learning advanced mathematics and preparing for the graduate school math entrance exam. This study investigates two distinctive features within Xiaomai: the incorporation of free-response questions with automatic feedback and the metacognitive element of reflecting on self-made errors.

Via

Access Paper or Ask Questions

SGTC: Semantic-Guided Triplet Co-training for Sparsely Annotated Semi-Supervised Medical Image Segmentation

Dec 20, 2024

Ke Yan, Qing Cai, Fan Zhang, Ziyan Cao, Zhi Liu

Figure 1 for SGTC: Semantic-Guided Triplet Co-training for Sparsely Annotated Semi-Supervised Medical Image Segmentation

Figure 2 for SGTC: Semantic-Guided Triplet Co-training for Sparsely Annotated Semi-Supervised Medical Image Segmentation

Figure 3 for SGTC: Semantic-Guided Triplet Co-training for Sparsely Annotated Semi-Supervised Medical Image Segmentation

Figure 4 for SGTC: Semantic-Guided Triplet Co-training for Sparsely Annotated Semi-Supervised Medical Image Segmentation

Abstract:Although semi-supervised learning has made significant advances in the field of medical image segmentation, fully annotating a volumetric sample slice by slice remains a costly and time-consuming task. Even worse, most of the existing approaches pay much attention to image-level information and ignore semantic features, resulting in the inability to perceive weak boundaries. To address these issues, we propose a novel Semantic-Guided Triplet Co-training (SGTC) framework, which achieves high-end medical image segmentation by only annotating three orthogonal slices of a few volumetric samples, significantly alleviating the burden of radiologists. Our method consist of two main components. Specifically, to enable semantic-aware, fine-granular segmentation and enhance the quality of pseudo-labels, a novel semantic-guided auxiliary learning mechanism is proposed based on the pretrained CLIP. In addition, focusing on a more challenging but clinically realistic scenario, a new triple-view disparity training strategy is proposed, which uses sparse annotations (i.e., only three labeled slices of a few volumes) to perform co-training between three sub-networks, significantly improving the robustness. Extensive experiments on three public medical datasets demonstrate that our method outperforms most state-of-the-art semi-supervised counterparts under sparse annotation settings. The source code is available at https://github.com/xmeimeimei/SGTC.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

Nov 28, 2024

ShuaiHeng Li, Qing Cai, Fan Zhang, Menghuan Zhang, Yangyang Shu, Zhi Liu, Huafeng Li, Lingqiao Liu

Figure 1 for PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

Figure 2 for PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

Figure 3 for PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

Figure 4 for PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

Abstract:Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.

Via

Access Paper or Ask Questions

Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Nov 23, 2024

Zhiying Li, Zhi Liu, Guanggang Geng, Shreyank N Gowda, Shuyuan Lin, Jian Weng, Xiaobo Jin

Figure 1 for Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Figure 2 for Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Figure 3 for Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Figure 4 for Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Abstract:Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.

* 13 pages, 8 figures

Via

Access Paper or Ask Questions

CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification

Oct 12, 2024

Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, Jinhai Xiang

Abstract:Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP). However, the absence of concrete descriptions necessitates the use of implicit text embeddings, which demand complicated and inefficient training strategies. To address this issue, we first propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images, and thereby boost person re-identification with large vision language models. Using models like the Large Language and Vision Assistant (LLAVA), we generate high-quality captions based on fixed templates that capture key semantic attributes such as gender, clothing, and age. By augmenting ReID training sets from uni-modality (image) to bi-modality (image and text), we introduce CLIP-SCGI, a simple yet effective framework that leverages synthesized captions to guide the learning of discriminative and robust representations. Built on CLIP, CLIP-SCGI fuses image and text embeddings through two modules to enhance the training process. To address quality issues in generated captions, we introduce a caption-guided inversion module that captures semantic attributes from images by converting relevant visual information into pseudo-word tokens based on the descriptions. This approach helps the model better capture key information and focus on relevant regions. The extracted features are then utilized in a cross-modal fusion module, guiding the model to focus on regions semantically consistent with the caption, thereby facilitating the optimization of the visual encoder to extract discriminative and robust representations. Extensive experiments on four popular ReID benchmarks demonstrate that CLIP-SCGI outperforms the state-of-the-art by a significant margin.

Via

Access Paper or Ask Questions

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

Sep 27, 2024

Chuang Chen, Xiao Sun, Zhi Liu

Abstract:Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo.

* Submitted to TIP

Via

Access Paper or Ask Questions

Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

Sep 18, 2024

Cuiwei Liu, Siang Xu, Huaijun Qiu, Jing Zhang, Zhi Liu, Liang Zhao

Figure 1 for Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

Figure 2 for Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

Figure 3 for Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

Figure 4 for Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

Abstract:Few-shot class-incremental learning is crucial for developing scalable and adaptive intelligent systems, as it enables models to acquire new classes with minimal annotated data while safeguarding the previously accumulated knowledge. Nonetheless, existing methods deal with continuous data streams in a centralized manner, limiting their applicability in scenarios that prioritize data privacy and security. To this end, this paper introduces federated few-shot class-incremental learning, a decentralized machine learning paradigm tailored to progressively learn new classes from scarce data distributed across multiple clients. In this learning paradigm, clients locally update their models with new classes while preserving data privacy, and then transmit the model updates to a central server where they are aggregated globally. However, this paradigm faces several issues, such as difficulties in few-shot learning, catastrophic forgetting, and data heterogeneity. To address these challenges, we present a synthetic data-driven framework that leverages replay buffer data to maintain existing knowledge and facilitate the acquisition of new knowledge. Within this framework, a noise-aware generative replay module is developed to fine-tune local models with a balance of new and replay data, while generating synthetic data of new classes to further expand the replay buffer for future tasks. Furthermore, a class-specific weighted aggregation strategy is designed to tackle data heterogeneity by adaptively aggregating class-specific parameters based on local models performance on synthetic data. This enables effective global model optimization without direct access to client data. Comprehensive experiments across three widely-used datasets underscore the effectiveness and preeminence of the introduced framework.

Via

Access Paper or Ask Questions

Optimizing Federated Graph Learning with Inherent Structural Knowledge and Dual-Densely Connected GNNs

Aug 21, 2024

Longwen Wang, Jianchun Liu, Zhi Liu, Jinyang Huang

Abstract:Federated Graph Learning (FGL) is an emerging technology that enables clients to collaboratively train powerful Graph Neural Networks (GNNs) in a distributed manner without exposing their private data. Nevertheless, FGL still faces the challenge of the severe non-Independent and Identically Distributed (non-IID) nature of graphs, which possess diverse node and edge structures, especially across varied domains. Thus, exploring the knowledge inherent in these structures becomes significantly crucial. Existing methods, however, either overlook the inherent structural knowledge in graph data or capture it at the cost of significantly increased resource demands (e.g., FLOPs and communication bandwidth), which can be detrimental to distributed paradigms. Inspired by this, we propose FedDense, a novel FGL framework that optimizes the utilization efficiency of inherent structural knowledge. To better acquire knowledge of diverse and underexploited structures, FedDense first explicitly encodes the structural knowledge inherent within graph data itself alongside node features. Besides, FedDense introduces a Dual-Densely Connected (DDC) GNN architecture that exploits the multi-scale (i.e., one-hop to multi-hop) feature and structure insights embedded in the aggregated feature maps at each layer. In addition to the exploitation of inherent structures, we consider resource limitations in FGL, devising exceedingly narrow layers atop the DDC architecture and adopting a selective parameter sharing strategy to reduce resource costs substantially. We conduct extensive experiments using 15 datasets across 4 different domains, demonstrating that FedDense consistently surpasses baselines by a large margin in training performance, while demanding minimal resources.

Via

Access Paper or Ask Questions

FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

Aug 07, 2024

Ruiqi Wang, Jinyang Huang, Jie Zhang, Xin Liu, Xiang Zhang, Zhi Liu, Peng Zhao, Sigui Chen, Xiao Sun

Figure 1 for FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

Figure 2 for FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

Figure 3 for FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

Figure 4 for FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks

Abstract:Depression is a prevalent mental health disorder that significantly impacts individuals' lives and well-being. Early detection and intervention are crucial for effective treatment and management of depression. Recently, there are many end-to-end deep learning methods leveraging the facial expression features for automatic depression detection. However, most current methods overlook the temporal dynamics of facial expressions. Although very recent 3DCNN methods remedy this gap, they introduce more computational cost due to the selection of CNN-based backbones and redundant facial features. To address the above limitations, by considering the timing correlation of facial expressions, we propose a novel framework called FacialPulse, which recognizes depression with high accuracy and speed. By harnessing the bidirectional nature and proficiently addressing long-term dependencies, the Facial Motion Modeling Module (FMMM) is designed in FacialPulse to fully capture temporal features. Since the proposed FMMM has parallel processing capabilities and has the gate mechanism to mitigate gradient vanishing, this module can also significantly boost the training speed. Besides, to effectively use facial landmarks to replace original images to decrease information redundancy, a Facial Landmark Calibration Module (FLCM) is designed to eliminate facial landmark errors to further improve recognition accuracy. Extensive experiments on the AVEC2014 dataset and MMDA dataset (a depression dataset) demonstrate the superiority of FacialPulse on recognition accuracy and speed, with the average MAE (Mean Absolute Error) decreased by 21% compared to baselines, and the recognition speed increased by 100% compared to state-of-the-art methods. Codes are released at https://github.com/volatileee/FacialPulse.

Via

Access Paper or Ask Questions