Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ekaterina Artemova

LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Aug 08, 2024

Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie(+14 more)

Figure 1 for LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Figure 2 for LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Figure 3 for LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Figure 4 for LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Abstract:The widespread accessibility of large language models (LLMs) to the general public has significantly amplified the dissemination of machine-generated texts (MGTs). Advancements in prompt manipulation have exacerbated the difficulty in discerning the origin of a text (human-authored vs machinegenerated). This raises concerns regarding the potential misuse of MGTs, particularly within educational and academic domains. In this paper, we present $\textbf{LLM-DetectAIve}$ -- a system designed for fine-grained MGT detection. It is able to classify texts into four categories: human-written, machine-generated, machine-written machine-humanized, and human-written machine-polished. Contrary to previous MGT detectors that perform binary classification, introducing two additional categories in LLM-DetectiAIve offers insights into the varying degrees of LLM intervention during the text creation. This might be useful in some domains like education, where any LLM intervention is usually prohibited. Experiments show that LLM-DetectAIve can effectively identify the authorship of textual content, proving its usefulness in enhancing integrity in education, academia, and other domains. LLM-DetectAIve is publicly accessible at https://huggingface.co/spaces/raj-tomar001/MGT-New. The video describing our system is available at https://youtu.be/E8eT_bE7k8c.

Via

Access Paper or Ask Questions

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Jul 24, 2024

Nikita Andreev, Alexander Shirnin, Vladislav Mikhailov, Ekaterina Artemova

Figure 1 for Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Figure 2 for Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Figure 3 for Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Figure 4 for Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Abstract:This paper presents Papilusion, an AI-generated scientific text detector developed within the DAGPap24 shared task on detecting automatically generated scientific papers. We propose an ensemble-based approach and conduct ablation studies to analyze the effect of the detector configurations on the performance. Papilusion is ranked 6th on the leaderboard, and we improve our performance after the competition ended, achieving 99.46 (+9.63) of the F1-score on the official test set.

* to appear in DAGPAP 2024 proceedings

Via

Access Paper or Ask Questions

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

Jun 27, 2024

Ekaterina Taktasheva, Maxim Bazhukov, Kirill Koncha, Alena Fenogenova, Ekaterina Artemova

Abstract:Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.

Via

Access Paper or Ask Questions

AIpom at SemEval-2024 Task 8: Detecting AI-produced Outputs in M4

Mar 28, 2024

Alexander Shirnin, Nikita Andreev, Vladislav Mikhailov, Ekaterina Artemova

Abstract:This paper describes AIpom, a system designed to detect a boundary between human-written and machine-generated text (SemEval-2024 Task 8, Subtask C: Human-Machine Mixed Text Detection). We propose a two-stage pipeline combining predictions from an instruction-tuned decoder-only model and encoder-only sequence taggers. AIpom is ranked second on the leaderboard while achieving a Mean Absolute Error of 15.94. Ablation studies confirm the benefits of pipelining encoder and decoder models, particularly in terms of improved performance.

* 2nd place at SemEval-2024 Task 8, Subtask C, to appear in SemEval-2024 proceedings

Via

Access Paper or Ask Questions

RuBia: A Russian Language Bias Detection Dataset

Mar 26, 2024

Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, Ekaterina Artemova

Abstract:Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM's behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias evaluation by presenting a bias detection dataset specifically designed for the Russian language, dubbed as RuBia. The RuBia dataset is divided into 4 domains: gender, nationality, socio-economic status, and diverse, each of the domains is further divided into multiple fine-grained subdomains. Every example in the dataset consists of two sentences with the first reinforcing a potentially harmful stereotype or trope and the second contradicting it. These sentence pairs were first written by volunteers and then validated by native-speaking crowdsourcing workers. Overall, there are nearly 2,000 unique sentence pairs spread over 19 subdomains in RuBia. To illustrate the dataset's purpose, we conduct a diagnostic evaluation of state-of-the-art or near-state-of-the-art LLMs and discuss the LLMs' predisposition to social biases.

* accepted to LREC-COLING 2024

Via

Access Paper or Ask Questions

Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Mar 19, 2024

Siyao Peng, Zihang Sun, Huangyan Shan, Marie Kolm, Verena Blaschke, Ekaterina Artemova, Barbara Plank

Figure 1 for Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Figure 2 for Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Figure 3 for Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Figure 4 for Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Abstract:Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

* LREC-COLING 2024

Via

Access Paper or Ask Questions

Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Feb 03, 2024

Ekaterina Artemova, Verena Blaschke, Barbara Plank

Figure 1 for Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Figure 2 for Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Figure 3 for Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Figure 4 for Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

Abstract:Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets. Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance. Using these new datasets, we conduct an experimental evaluation across six different transformers. Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score. Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

* To appear in EACL 2024 (main)

Via

Access Paper or Ask Questions

LUNA: A Framework for Language Understanding and Naturalness Assessment

Jan 09, 2024

Marat Saidov, Aleksandra Bakalova, Ekaterina Taktasheva, Vladislav Mikhailov, Ekaterina Artemova

Figure 1 for LUNA: A Framework for Language Understanding and Naturalness Assessment

Abstract:The evaluation of Natural Language Generation (NLG) models has gained increased attention, urging the development of metrics that evaluate various aspects of generated text. LUNA addresses this challenge by introducing a unified interface for 20 NLG evaluation metrics. These metrics are categorized based on their reference-dependence and the type of text representation they employ, from string-based n-gram overlap to the utilization of static embeddings and pre-trained language models. The straightforward design of LUNA allows for easy extension with novel metrics, requiring just a few lines of code. LUNA offers a user-friendly tool for evaluating generated texts.

Via

Access Paper or Ask Questions

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Sep 04, 2023

Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova, Barbara Plank

Figure 1 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Figure 2 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Figure 3 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Figure 4 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Abstract:Instruction-tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality issues of gold-standard labels. But so far, the application of AED methods is limited to discriminative settings. It is an open question how well AED methods generalize to generative settings which are becoming widespread via generative LLMs. In this work, we present a first and new benchmark for AED on instruction-tuning data: Donkii. It encompasses three instruction-tuning datasets enriched with annotations by experts and semi-automatic methods. We find that all three datasets contain clear-cut errors that sometimes directly propagate into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them comprehensively on the newly introduced dataset. Our results demonstrate that choosing the right AED method and model size is indeed crucial, thereby deriving practical recommendations. To gain insights, we provide a first case-study to examine how the quality of the instruction-tuning datasets influences downstream performance.

Via

Access Paper or Ask Questions

Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

May 09, 2023

Robert Litschko, Ekaterina Artemova, Barbara Plank

Abstract:Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.

* Accepted to Findings of ACL 2023

Via

Access Paper or Ask Questions