Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Preslav Nakov

Mohamed bin Zayed University of Artificial Intelligence

QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Oct 11, 2023

Liangming Pan, Xinyuan Lu, Min-Yen Kan, Preslav Nakov

Figure 1 for QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Figure 2 for QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Figure 3 for QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Figure 4 for QACHECK: A Demonstration System for Question-Guided Multi-Hop Fact-Checking

Abstract:Fact-checking real-world claims often requires complex, multi-step reasoning due to the absence of direct evidence to support or refute them. However, existing fact-checking systems often lack transparency in their decision-making, making it challenging for users to comprehend their reasoning process. To address this, we propose the Question-guided Multi-hop Fact-Checking (QACHECK) system, which guides the model's reasoning process by asking a series of questions critical for verifying a claim. QACHECK has five key modules: a claim verifier, a question generator, a question-answering module, a QA validator, and a reasoner. Users can input a claim into QACHECK, which then predicts its veracity and provides a comprehensive report detailing its reasoning process, guided by a sequence of (question, answer) pairs. QACHECK also provides the source of evidence supporting each question, fostering a transparent, explainable, and user-friendly fact-checking process. A recorded video of QACHECK is at https://www.youtube.com/watch?v=ju8kxSldM64

* Accepted at EMNLP 2023 System Demonstrations Track

Via

Access Paper or Ask Questions

Factuality Challenges in the Era of Large Language Models

Oct 10, 2023

Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy(+8 more)

Abstract:The emergence of tools based on Large Language Models (LLMs), such as OpenAI's ChatGPT, Microsoft's Bing Chat, and Google's Bard, has garnered immense public attention. These incredibly useful, natural-sounding tools mark significant advances in natural language generation, yet they exhibit a propensity to generate false, erroneous, or misleading content -- commonly referred to as "hallucinations." Moreover, LLMs can be exploited for malicious applications, such as generating false but credible-sounding content and profiles at scale. This poses a significant challenge to society in terms of the potential deception of users and the increasing dissemination of inaccurate information. In light of these risks, we explore the kinds of technological innovations, regulatory reforms, and AI literacy initiatives needed from fact-checkers, news organizations, and the broader research and policy communities. By identifying the risks, the imminent threats, and some viable solutions, we seek to shed light on navigating various aspects of veracity in the era of generative AI.

* Our article offers a comprehensive examination of the challenges and risks associated with Large Language Models (LLMs), focusing on their potential impact on the veracity of information in today's digital landscape

Via

Access Paper or Ask Questions

Rethinking STS and NLI in Large Language Models

Sep 16, 2023

Yuxia Wang, Minghan Wang, Preslav Nakov

Abstract:In this study, we aim to rethink STS and NLI in the era of large language models (LLMs). We first evaluate the accuracy of clinical/biomedical STS and NLI over five datasets, and then we assess LLM predictive confidence and their capability of capturing collective human opinions. We find that LLMs may be able to provide personalised descriptions for a specific topic, or to generate semantically similar content in different tones, but that this is hard for current LLMs to make personalised judgements or decisions. We further find that zero-shot ChatGPT achieves competitive accuracy over clinical and biomedical STS/NLI, constraining to the fine-tuned BERT-base. However, there is a large variation in sampling, ensembled results perform the best.

* arXiv admin note: text overlap with arXiv:2212.13138 by other authors

Via

Access Paper or Ask Questions

Fake News Detectors are Biased against Texts Generated by Large Language Models

Sep 15, 2023

Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, Preslav Nakov

Abstract:The spread of fake news has emerged as a critical challenge, undermining trust and posing threats to society. In the era of Large Language Models (LLMs), the capability to generate believable fake content has intensified these concerns. In this study, we present a novel paradigm to evaluate fake news detectors in scenarios involving both human-written and LLM-generated misinformation. Intriguingly, our findings reveal a significant bias in many existing detectors: they are more prone to flagging LLM-generated content as fake news while often misclassifying human-written fake news as genuine. This unexpected bias appears to arise from distinct linguistic patterns inherent to LLM outputs. To address this, we introduce a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. The resulting model yielded marked improvements in detection accuracy for both human and LLM-generated news. To further catalyze research in this domain, we release two comprehensive datasets, \texttt{GossipCop++} and \texttt{PolitiFact++}, thus amalgamating human-validated articles with LLM-generated fake and real news.

* The first two authors contributed equally

Via

Access Paper or Ask Questions

Gpachov at CheckThat! 2023: A Diverse Multi-Approach Ensemble for Subjectivity Detection in News Articles

Sep 13, 2023

Georgi Pachov, Dimitar Dimitrov, Ivan Koychev, Preslav Nakov

Figure 1 for Gpachov at CheckThat! 2023: A Diverse Multi-Approach Ensemble for Subjectivity Detection in News Articles

Figure 2 for Gpachov at CheckThat! 2023: A Diverse Multi-Approach Ensemble for Subjectivity Detection in News Articles

Figure 3 for Gpachov at CheckThat! 2023: A Diverse Multi-Approach Ensemble for Subjectivity Detection in News Articles

Figure 4 for Gpachov at CheckThat! 2023: A Diverse Multi-Approach Ensemble for Subjectivity Detection in News Articles

Abstract:The wide-spread use of social networks has given rise to subjective, misleading, and even false information on the Internet. Thus, subjectivity detection can play an important role in ensuring the objectiveness and the quality of a piece of information. This paper presents the solution built by the Gpachov team for the CLEF-2023 CheckThat! lab Task~2 on subjectivity detection. Three different research directions are explored. The first one is based on fine-tuning a sentence embeddings encoder model and dimensionality reduction. The second one explores a sample-efficient few-shot learning model. The third one evaluates fine-tuning a multilingual transformer on an altered dataset, using data from multiple languages. Finally, the three approaches are combined in a simple majority voting ensemble, resulting in 0.77 macro F1 on the test set and achieving 2nd place on the English subtask.

Via

Access Paper or Ask Questions

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Sep 04, 2023

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin

Abstract:With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to be able to identify risks through the evaluation of "dangerous capabilities" in order to responsibly deploy LLMs. In this work, we collect the first open-source dataset to evaluate safeguards in LLMs, and deploy safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions. Based on our annotation, we proceed to train several BERT-like classifiers, and find that these small classifiers can achieve results that are comparable with GPT-4 on automatic safety evaluation. Warning: this paper contains example data that may be offensive, harmful, or biased.

* 18 pages, 9 figures, 11 tables

Via

Access Paper or Ask Questions

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Aug 30, 2023

Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal(+12 more)

Figure 1 for Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Figure 2 for Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Figure 3 for Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Figure 4 for Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Abstract:We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs. Available at https://huggingface.co/inception-mbzuai/jais-13b-chat

* Arabic-centric, foundation model, large-language model, LLM, generative model, instruction-tuned, Jais, Jais-chat

Via

Access Paper or Ask Questions

bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Jun 07, 2023

Momchil Hardalov, Pepa Atanasova, Todor Mihaylov, Galia Angelova, Kiril Simov, Petya Osenova, Ves Stoyanov, Ivan Koychev, Preslav Nakov, Dragomir Radev

Figure 1 for bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Figure 2 for bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Figure 3 for bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Figure 4 for bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark

Abstract:We present bgGLUE(Bulgarian General Language Understanding Evaluation), a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety of NLP problems (e.g., natural language inference, fact-checking, named entity recognition, sentiment analysis, question answering, etc.) and machine learning tasks (sequence labeling, document-level classification, and regression). We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark. The evaluation results show strong performance on sequence labeling tasks, but there is a lot of room for improvement for tasks that require more complex reasoning. We make bgGLUE publicly available together with the fine-tuning and the evaluation code, as well as a public leaderboard at https://bgglue.github.io/, and we hope that it will enable further advancements in developing NLU models for Bulgarian.

* ACL 2023
* Accepted to ACL 2023 (Main Conference)

Via

Access Paper or Ask Questions

Understanding Breast Cancer Survival: Using Causality and Language Models on Multi-omics Data

May 28, 2023

Mugariya Farooq, Shahad Hardan, Aigerim Zhumbhayeva, Yujia Zheng, Preslav Nakov, Kun Zhang

Figure 1 for Understanding Breast Cancer Survival: Using Causality and Language Models on Multi-omics Data

Figure 2 for Understanding Breast Cancer Survival: Using Causality and Language Models on Multi-omics Data

Figure 3 for Understanding Breast Cancer Survival: Using Causality and Language Models on Multi-omics Data

Figure 4 for Understanding Breast Cancer Survival: Using Causality and Language Models on Multi-omics Data

Abstract:The need for more usable and explainable machine learning models in healthcare increases the importance of developing and utilizing causal discovery algorithms, which aim to discover causal relations by analyzing observational data. Explainable approaches aid clinicians and biologists in predicting the prognosis of diseases and suggesting proper treatments. However, very little research has been conducted at the crossroads between causal discovery, genomics, and breast cancer, and we aim to bridge this gap. Moreover, evaluation of causal discovery methods on real data is in general notoriously difficult because ground-truth causal relations are usually unknown, and accordingly, in this paper, we also propose to address the evaluation problem with large language models. In particular, we exploit suitable causal discovery algorithms to investigate how various perturbations in the genome can affect the survival of patients diagnosed with breast cancer. We used three main causal discovery algorithms: PC, Greedy Equivalence Search (GES), and a Generalized Precision Matrix-based one. We experiment with a subset of The Cancer Genome Atlas, which contains information about mutations, copy number variations, protein levels, and gene expressions for 705 breast cancer patients. Our findings reveal important factors related to the vital status of patients using causal discovery algorithms. However, the reliability of these results remains a concern in the medical domain. Accordingly, as another contribution of the work, the results are validated through language models trained on biomedical literature, such as BlueBERT and other large language models trained on medical corpora. Our results profess proper utilization of causal discovery algorithms and language models for revealing reliable causal relations for clinical applications.

Via

Access Paper or Ask Questions

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

May 24, 2023

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji(+1 more)

Figure 1 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Figure 2 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Figure 3 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Figure 4 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Abstract:Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at https://github.com/mbzuai-nlp/M4.

* 11 pages

Via

Access Paper or Ask Questions