There is an increasing interest in developing LLMs for medical diagnosis to improve diagnosis efficiency. Despite their alluring technological potential, there is no unified and comprehensive evaluation criterion, leading to the inability to evaluate the quality and potential risks of medical LLMs, further hindering the application of LLMs in medical treatment scenarios. Besides, current evaluations heavily rely on labor-intensive interactions with LLMs to obtain diagnostic dialogues and human evaluation on the quality of diagnosis dialogue. To tackle the lack of unified and comprehensive evaluation criterion, we first initially establish an evaluation criterion, termed LLM-specific Mini-CEX to assess the diagnostic capabilities of LLMs effectively, based on original Mini-CEX. To address the labor-intensive interaction problem, we develop a patient simulator to engage in automatic conversations with LLMs, and utilize ChatGPT for evaluating diagnosis dialogues automatically. Experimental results show that the LLM-specific Mini-CEX is adequate and necessary to evaluate medical diagnosis dialogue. Besides, ChatGPT can replace manual evaluation on the metrics of humanistic qualities and provides reproducible and automated comparisons between different LLMs.
Continual pretraining is a standard way of building a domain-specific pretrained language model from a general-domain language model. However, sequential task training may cause catastrophic forgetting, which affects the model performance in downstream tasks. In this paper, we propose a continual pretraining method for the BERT-based model, named CBEAF-Adapting (Chinese Biomedical Enhanced Attention-FFN Adapting). Its main idea is to introduce a small number of attention heads and hidden units inside each self-attention layer and feed-forward network. Using the Chinese biomedical domain as a running example, we trained a domain-specific language model named CBEAF-RoBERTa. We conduct experiments by applying models to downstream tasks. The results demonstrate that with only about 3% of model parameters trained, our method could achieve about 0.5%, 2% average performance gain compared to the best performing model in baseline and the domain-specific model, PCL-MedBERT, respectively. We also examine the forgetting problem of different pretraining methods. Our method alleviates the problem by about 13% compared to fine-tuning.
Prompt tuning is an emerging way of adapting pre-trained language models to downstream tasks. However, the existing studies are mainly to add prompts to the input sequence. This way would not work as expected due to the intermediate multi-head self-attention and feed-forward network computation, making model optimization not very smooth. Hence, we propose a novel tuning way called layer tuning, aiming to add learnable parameters in Transformer layers. Specifically, we focus on layer tuning for feed-forward network in the Transformer, namely FL-tuning. It introduces additional units into the hidden layer of each feed-forward network. We conduct extensive experiments on the public CLUE benchmark. The results show that: 1) Our FL-tuning outperforms prompt tuning methods under both full-data and few-shot settings in almost all cases. In particular, it improves accuracy by 17.93% (full-data setting) on WSC 1.0 and F1 by 16.142% (few-shot setting) on CLUENER over P-tuning v2. 2) Our FL-tuning is more stable and converges about 1.17 times faster than P-tuning v2. 3) With only about 3% of Transformer's parameters to be trained, FL-tuning is comparable with fine-tuning on most datasets, and significantly outperforms fine-tuning (e.g., accuracy improved by 12.9% on WSC 1.1) on several datasets. The source codes are available at https://github.com/genggui001/FL-Tuning.
The medical dialogue system is a promising application that can provide great convenience for patients. The dialogue state tracking (DST) module in the medical dialogue system which interprets utterances into the machine-readable structure for downstream tasks is particularly challenging. Firstly, the states need to be able to represent compound entities such as symptoms with their body part or diseases with degrees of severity to provide enough information for decision support. Secondly, these named entities in the utterance might be discontinuous and scattered across sentences and speakers. These also make it difficult to annotate a large corpus which is essential for most methods. Therefore, we first define a multi-hierarchical state structure. We annotate and publish a medical dialogue dataset in Chinese. To the best of our knowledge, there are no publicly available ones before. Then we propose a Prompt-based Generative Approach which can generate slot values with multi-hierarchies incrementally using a top-down approach. A dialogue style prompt is also supplemented to utilize the large unlabeled dialogue corpus to alleviate the data scarcity problem. The experiments show that our approach outperforms other DST methods and is rather effective in the scenario with little data.
Medical terminology normalization aims to map the clinical mention to terminologies come from a knowledge base, which plays an important role in analyzing Electronic Health Record(EHR) and many downstream tasks. In this paper, we focus on Chinese procedure terminology normalization. The expression of terminologies are various and one medical mention may be linked to multiple terminologies. Previous study explores some methods such as multi-class classification or learning to rank(LTR) to sort the terminologies by literature and semantic information. However, these information is inadequate to find the right terminologies, particularly in multi-implication cases. In this work, we propose a combined recall and rank framework to solve the above problems. This framework is composed of a multi-task candidate generator(MTCG), a keywords attentive ranker(KAR) and a fusion block(FB). MTCG is utilized to predict the mention implication number and recall candidates with semantic similarity. KAR is based on Bert with a keywords attentive mechanism which focuses on keywords such as procedure sites and procedure types. FB merges the similarity come from MTCG and KAR to sort the terminologies from different perspectives. Detailed experimental analysis shows our proposed framework has a remarkable improvement on both performance and efficiency.
Entity and relation extraction is the necessary step in structuring medical text. However, the feature extraction ability of the bidirectional long short term memory network in the existing model does not achieve the best effect. At the same time, the language model has achieved excellent results in more and more natural language processing tasks. In this paper, we present a focused attention model for the joint entity and relation extraction task. Our model integrates well-known BERT language model into joint learning through dynamic range attention mechanism, thus improving the feature representation ability of shared parameter layer. Experimental results on coronary angiography texts collected from Shuguang Hospital show that the F1-score of named entity recognition and relation classification tasks reach 96.89% and 88.51%, which are better than state-of-the-art methods 1.65% and 1.22%, respectively.
Clinical text structuring is a critical and fundamental task for clinical research. Traditional methods such as taskspecific end-to-end models and pipeline models usually suffer from the lack of dataset and error propagation. In this paper, we present a question answering based clinical text structuring (QA-CTS) task to unify different specific tasks and make dataset shareable. A novel model that aims to introduce domain-specific features (e.g., clinical named entity information) into pre-trained language model is also proposed for QA-CTS task. Experimental results on Chinese pathology reports collected from Ruijing Hospital demonstrate our presented QA-CTS task is very effective to improve the performance on specific tasks. Our proposed model also competes favorably with strong baseline models in specific tasks.
Medical activities, such as diagnoses, medicine treatments, and laboratory tests, as well as temporal relations between these activities are the basic concepts in clinical research. However, existing relational data model on electronic medical records (EMRs) lacks explicit and accurate semantic definitions of these concepts. It leads to the inconvenience of query construction and the inefficiency of query execution where multi-table join queries are frequently required. In this paper, we propose a patient event graph (PatientEG) model to capture the characteristics of EMRs. We respectively define five types of medical entities, five types of medical events and five types of temporal relations. Based on the proposed model, we also construct a PatientEG dataset with 191,294 events, 3,429 distinct entities, and 545,993 temporal relations using EMRs from Shanghai Shuguang hospital. To help to normalize entity values which contain synonyms, hyponymies, and abbreviations, we link them with the Chinese biomedical knowledge graph. With the help of PatientEG dataset, we are able to conveniently perform complex queries for clinical research such as auxiliary diagnosis and therapeutic effectiveness analysis. In addition, we provide a SPARQL endpoint to access PatientEG dataset and the dataset is also publicly available online. Also, we list several illustrative SPARQL queries on our website.
Clinical Named Entity Recognition (CNER) aims to identify and classify clinical terms such as diseases, symptoms, treatments, exams, and body parts in electronic health records, which is a fundamental and crucial task for clinical and translation research. In recent years, deep learning methods have achieved significant success in CNER tasks. However, these methods depend greatly on Recurrent Neural Networks (RNNs), which maintain a vector of hidden activations that are propagated through time, thus causing too much time to train models. In this paper, we propose a Residual Dilated Convolutional Neural Network with Conditional Random Field (RD-CNN-CRF) to solve the Chinese CNER. Specifically, Chinese characters and dictionary features are first projected into dense vector representations, then they are fed into the residual dilated convolutional neural network to capture contextual features. Finally, a conditional random field is employed to capture dependencies between neighboring tags. Computational results on the CCKS-2017 Task 2 benchmark dataset show that our proposed RD-CNN-CRF method competes favorably with state-of-the-art RNN-based methods both in terms of computational performance and training time.
Named entities which composed of multiple continuous words frequently occur in domain-specific knowledge graphs. These entities are usually composable and extensible. Typical examples are names of symptoms and diseases in medical areas. To distinguish these entities from general entities, we name them compound entities. Hypernymy detection between compound entities plays an important role in domain-specific knowledge graph construction. Traditional hypernymy detection approaches cannot perform well on compound entities due to the lack of contextual information in texts, and even the absence of compound entities in training sets, i.e. Out-Of-Vocabulary (OOV) problem. In this paper, we present a novel attention-based Bi-GRU-CapsNet model to detect hypernymy relationship between compound entities. Our model consists of several important components. To avoid the OOV problem, English words or Chinese characters in compound entities are fed into Bidirectional Gated Recurrent Units (Bi-GRUs). An attention mechanism is designed to focus on the differences between two compound entities. Since there are some different cases in hypernymy relationship between compound entities, Capsule Network (CapsNet) is finally employed to decide whether the hypernymy relationship exists or not. Experimental results demonstrate the advantages of our model over the state-of-the-art methods both on English and Chinese corpora of symptom and disease pairs.