Alert button
Picture for Bevan Koopman

Bevan Koopman

Alert button

ChatGPT Hallucinates when Attributing Answers

Sep 17, 2023
Guido Zuccon, Bevan Koopman, Razia Shaik

Can ChatGPT provide evidence to support its answers? Does the evidence it suggests actually exist and does it really support its answer? We investigate these questions using a collection of domain-specific knowledge-based questions, specifically prompting ChatGPT to provide both an answer and supporting evidence in the form of references to external sources. We also investigate how different prompts impact answers and evidence. We find that ChatGPT provides correct or partially correct answers in about half of the cases (50.6% of the times), but its suggested references only exist 14% of the times. We further provide insights on the generated references that reveal common traits among the references that ChatGPT generates, and show how even if a reference provided by the model does exist, this reference often does not support the claims ChatGPT attributes to it. Our findings are important because (1) they are the first systematic analysis of the references created by ChatGPT in its answers; (2) they suggest that the model may leverage good quality information in producing correct answers, but is unable to attribute real evidence to support its answers. Prompts, raw result files and manual analysis are made publicly available.

Viaarxiv icon

Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation

Sep 11, 2023
Shuai Wang, Harrisen Scells, Martin Potthast, Bevan Koopman, Guido Zuccon

Figure 1 for Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation
Figure 2 for Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation
Figure 3 for Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation
Figure 4 for Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation

Screening prioritisation in medical systematic reviews aims to rank the set of documents retrieved by complex Boolean queries. The goal is to prioritise the most important documents so that subsequent review steps can be carried out more efficiently and effectively. The current state of the art uses the final title of the review to rank documents using BERT-based neural neural rankers. However, the final title is only formulated at the end of the review process, which makes this approach impractical as it relies on ex post facto information. At the time of screening, only a rough working title is available, with which the BERT-based ranker achieves is significantly worse than the final title. In this paper, we explore alternative sources of queries for screening prioritisation, such as the Boolean query used to retrieve the set of documents to be screened, and queries generated by instruction-based generative large language models such as ChatGPT and Alpaca. Our best approach is not only practical based on the information available at screening time, but is similar in effectiveness with the final title.

* Preprints for Accepted paper in SIGIR-AP-2023 
Viaarxiv icon

Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

Jul 19, 2023
Aaron Nicolson, Jason Dowling, Bevan Koopman

Chest X-Ray (CXR) report generation is a promising approach to improving the efficiency of CXR interpretation. However, a significant increase in diagnostic accuracy is required before that can be realised. Motivated by this, we propose a framework that is more inline with a radiologist's workflow by considering longitudinal data. Here, the decoder is additionally conditioned on the report from the subject's previous imaging study via a prompt. We also propose a new reward for reinforcement learning based on CXR-BERT, which computes the similarity between reports. We conduct experiments on the MIMIC-CXR dataset. The results indicate that longitudinal data improves CXR report generation. CXR-BERT is also shown to be a promising alternative to the current state-of-the-art reward based on RadGraph. This investigation indicates that longitudinal CXR report generation can offer a substantial increase in diagnostic accuracy. Our Hugging Face model is available at: https://huggingface.co/aehrc/cxrmate and code is available at: https://github.com/aehrc/cxrmate.

Viaarxiv icon

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Feb 23, 2023
Guido Zuccon, Bevan Koopman

Figure 1 for Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness
Figure 2 for Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness
Figure 3 for Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness
Figure 4 for Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.

Viaarxiv icon

Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?

Feb 09, 2023
Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon

Figure 1 for Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?
Figure 2 for Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?
Figure 3 for Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?
Figure 4 for Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?

Systematic reviews are comprehensive reviews of the literature for a highly focused research question. These reviews are often treated as the highest form of evidence in evidence-based medicine, and are the key strategy to answer research questions in the medical field. To create a high-quality systematic review, complex Boolean queries are often constructed to retrieve studies for the review topic. However, it often takes a long time for systematic review researchers to construct a high quality systematic review Boolean query, and often the resulting queries are far from effective. Poor queries may lead to biased or invalid reviews, because they missed to retrieve key evidence, or to extensive increase in review costs, because they retrieved too many irrelevant studies. Recent advances in Transformer-based generative models have shown great potential to effectively follow instructions from users and generate answers based on the instructions being made. In this paper, we investigate the effectiveness of the latest of such models, ChatGPT, in generating effective Boolean queries for systematic review literature search. Through a number of extensive experiments on standard test collections for the task, we find that ChatGPT is capable of generating queries that lead to high search precision, although trading-off this for recall. Overall, our study demonstrates the potential of ChatGPT in generating effective Boolean queries for systematic review literature search. The ability of ChatGPT to follow complex instructions and generate queries with high precision makes it a valuable tool for researchers conducting systematic reviews, particularly for rapid reviews where time is a constraint and often trading-off higher precision for lower recall is acceptable.

Viaarxiv icon

AgAsk: An Agent to Help Answer Farmer's Questions From Scientific Documents

Dec 21, 2022
Bevan Koopman, Ahmed Mourad, Hang Li, Anton van der Vegt, Shengyao Zhuang, Simon Gibson, Yash Dang, David Lawrence, Guido Zuccon

Figure 1 for AgAsk: An Agent to Help Answer Farmer's Questions From Scientific Documents
Figure 2 for AgAsk: An Agent to Help Answer Farmer's Questions From Scientific Documents
Figure 3 for AgAsk: An Agent to Help Answer Farmer's Questions From Scientific Documents
Figure 4 for AgAsk: An Agent to Help Answer Farmer's Questions From Scientific Documents

Decisions in agriculture are increasingly data-driven; however, valuable agricultural knowledge is often locked away in free-text reports, manuals and journal articles. Specialised search systems are needed that can mine agricultural information to provide relevant answers to users' questions. This paper presents AgAsk -- an agent able to answer natural language agriculture questions by mining scientific documents. We carefully survey and analyse farmers' information needs. On the basis of these needs we release an information retrieval test collection comprising real questions, a large collection of scientific documents split in passages, and ground truth relevance assessments indicating which passages are relevant to each question. We implement and evaluate a number of information retrieval models to answer farmers questions, including two state-of-the-art neural ranking models. We show that neural rankers are highly effective at matching passages to questions in this context. Finally, we propose a deployment architecture for AgAsk that includes a client based on the Telegram messaging platform and retrieval model deployed on commodity hardware. The test collection we provide is intended to stimulate more research in methods to match natural language to answers in scientific documents. While the retrieval models were evaluated in the agriculture domain, they are generalisable and of interest to others working on similar problems. The test collection is available at: \url{https://github.com/ielab/agvaluate}.

* 17 pages, submitted to IJDL 
Viaarxiv icon

Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search

Dec 18, 2022
Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon

Figure 1 for Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search
Figure 2 for Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search
Figure 3 for Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search
Figure 4 for Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search

Medical systematic reviews typically require assessing all the documents retrieved by a search. The reason is two-fold: the task aims for ``total recall''; and documents retrieved using Boolean search are an unordered set, and thus it is unclear how an assessor could examine only a subset. Screening prioritisation is the process of ranking the (unordered) set of retrieved documents, allowing assessors to begin the downstream processes of the systematic review creation earlier, leading to earlier completion of the review, or even avoiding screening documents ranked least relevant. Screening prioritisation requires highly effective ranking methods. Pre-trained language models are state-of-the-art on many IR tasks but have yet to be applied to systematic review screening prioritisation. In this paper, we apply several pre-trained language models to the systematic review document ranking task, both directly and fine-tuned. An empirical analysis compares how effective neural methods compare to traditional methods for this task. We also investigate different types of document representations for neural methods and their impact on ranking performance. Our results show that BERT-based rankers outperform the current state-of-the-art screening prioritisation methods. However, BERT rankers and existing methods can actually be complementary, and thus, further improvements may be achieved if used in conjunction.

Viaarxiv icon

Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search

Sep 19, 2022
Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon

Figure 1 for Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search
Figure 2 for Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search
Figure 3 for Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search
Figure 4 for Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search

High-quality medical systematic reviews require comprehensive literature searches to ensure the recommendations and outcomes are sufficiently reliable. Indeed, searching for relevant medical literature is a key phase in constructing systematic reviews and often involves domain (medical researchers) and search (information specialists) experts in developing the search queries. Queries in this context are highly complex, based on Boolean logic, include free-text terms and index terms from standardised terminologies (e.g., the Medical Subject Headings (MeSH) thesaurus), and are difficult and time-consuming to build. The use of MeSH terms, in particular, has been shown to improve the quality of the search results. However, identifying the correct MeSH terms to include in a query is difficult: information experts are often unfamiliar with the MeSH database and unsure about the appropriateness of MeSH terms for a query. Naturally, the full value of the MeSH terminology is often not fully exploited. This article investigates methods to suggest MeSH terms based on an initial Boolean query that includes only free-text terms. In this context, we devise lexical and pre-trained language models based methods. These methods promise to automatically identify highly effective MeSH terms for inclusion in a systematic review query. Our study contributes an empirical evaluation of several MeSH term suggestion methods. We further contribute an extensive analysis of MeSH term suggestions for each method and how these suggestions impact the effectiveness of Boolean queries.

* This paper is currently in submission with Intelligent Systems with Applications Journal Technology-Assisted Review Systems Special issue and is under peer review. arXiv admin note: text overlap with arXiv:2112.00277 
Viaarxiv icon

How does Feedback Signal Quality Impact Effectiveness of Pseudo Relevance Feedback for Passage Retrieval?

May 12, 2022
Hang Li, Ahmed Mourad, Bevan Koopman, Guido Zuccon

Figure 1 for How does Feedback Signal Quality Impact Effectiveness of Pseudo Relevance Feedback for Passage Retrieval?
Figure 2 for How does Feedback Signal Quality Impact Effectiveness of Pseudo Relevance Feedback for Passage Retrieval?
Figure 3 for How does Feedback Signal Quality Impact Effectiveness of Pseudo Relevance Feedback for Passage Retrieval?

Pseudo-Relevance Feedback (PRF) assumes that the top results retrieved by a first-stage ranker are relevant to the original query and uses them to improve the query representation for a second round of retrieval. This assumption however is often not correct: some or even all of the feedback documents may be irrelevant. Indeed, the effectiveness of PRF methods may well depend on the quality of the feedback signal and thus on the effectiveness of the first-stage ranker. This aspect however has received little attention before. In this paper we control the quality of the feedback signal and measure its impact on a range of PRF methods, including traditional bag-of-words methods (Rocchio), and dense vector-based methods (learnt and not learnt). Our results show the important role the quality of the feedback signal plays on the effectiveness of PRF methods. Importantly, and surprisingly, our analysis reveals that not all PRF methods are the same when dealing with feedback signals of varying quality. These findings are critical to gain a better understanding of the PRF methods and of which and when they should be used, depending on the feedback signal quality, and set the basis for future research in this area.

* Accepted at SIGIR 2022 
Viaarxiv icon