Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anastasia Zhukova

Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language

Apr 30, 2025

Anastasia Zhukova, Christian E. Matt, Terry Ruas, Bela Gipp

Abstract:Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., language masking. Although popular, it requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that this approach performs better than traditional DAPT by 3.5 points of the average IR metrics (e.g., mAP, MRR, and nDCG) and requires almost 4 times less computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.

Via

Access Paper or Ask Questions

Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language

Dec 13, 2024

Anastasia Zhukova, Christian E. Matt, Bela Gipp

Abstract:Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. Collecting test datasets in a narrow domain is time-consuming and requires skilled human resources with domain knowledge and training for the annotation task. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language of the process industry. Our approach proposes an end-to-end annotation pipeline for automated query generation to the score reassessment of query-document pairs. To overcome the lack of text encoders trained in the German chemistry domain, we explore a principle of an ensemble of "weak" text encoders trained on common knowledge datasets. We combine individual relevance scores from diverse models to retrieve document candidates and relevance scores generated by an LLM, aiming to achieve consensus on query-document alignment. Evaluation results demonstrate that the ensemble method significantly improves alignment with human-assigned relevance scores, outperforming individual models in both inter-coder agreement and accuracy metrics. These findings suggest that ensemble learning can effectively adapt semantic search systems for specialized, low-resource languages, offering a practical solution to resource limitations in domain-specific contexts.

* accepted in the First Workshop on Language Models for Low-Resource Languages (LoResLM) co-located with the 31st International Conference on Computational Linguistics (COLING 2025)

Via

Access Paper or Ask Questions

Domain Adaptation of Multilingual Semantic Search -- Literature Review

Feb 05, 2024

Anna Bringmann, Anastasia Zhukova

Figure 1 for Domain Adaptation of Multilingual Semantic Search -- Literature Review

Figure 2 for Domain Adaptation of Multilingual Semantic Search -- Literature Review

Figure 3 for Domain Adaptation of Multilingual Semantic Search -- Literature Review

Abstract:This literature review gives an overview of current approaches to perform domain adaptation in a low-resource and approaches to perform multilingual semantic search in a low-resource setting. We developed a new typology to cluster domain adaptation approaches based on the part of dense textual information retrieval systems, which they adapt, focusing on how to combine them efficiently. We also explore the possibilities of combining multilingual semantic search with domain adaptation approaches for dense retrievers in a low-resource setting.

Via

Access Paper or Ask Questions

Generative User-Experience Research for Developing Domain-specific Natural Language Processing Applications

Jun 28, 2023

Anastasia Zhukova, Lukas von Sperl, Christian E. Matt, Bela Gipp

Abstract:User experience (UX) is a part of human-computer interaction (HCI) research and focuses on increasing intuitiveness, transparency, simplicity, and trust for system users. Most of the UX research for machine learning (ML) or natural language processing (NLP) focuses on a data-driven methodology, i.e., it fails to focus on users' requirements, and engages domain users mainly for usability evaluation. Moreover, more typical UX methods tailor the systems towards user usability, unlike learning about the user needs first. The paper proposes a methodology for integrating generative UX research into developing domain NLP applications. Generative UX research employs domain users at the initial stages of prototype development, i.e., ideation and concept evaluation, and the last stage for evaluating the change in user value. In the case study, we report the full-cycle prototype development of a domain-specific semantic search for daily operations in the process industry. Our case study shows that involving domain experts increases their interest and trust in the final NLP application. Moreover, we show that synergetic UX+NLP research efficiently considers data- and user-driven opportunities and constraints, which can be crucial for NLP applications in narrow domains

Via

Access Paper or Ask Questions

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Dec 13, 2021

Anastasia Zhukova, Felix Hamborg, Bela Gipp

Figure 1 for ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Figure 2 for ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Figure 3 for ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Figure 4 for ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Abstract:Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we find that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.

* Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021) co-located with JCDL 2021, Virtual Event

Via

Access Paper or Ask Questions

Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Oct 18, 2021

Felix Hamborg, Kim Heinser, Anastasia Zhukova, Karsten Donnay, Bela Gipp

Figure 1 for Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Figure 2 for Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Figure 3 for Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Figure 4 for Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Abstract:Media bias and its extreme form, fake news, can decisively affect public opinion. Especially when reporting on policy issues, slanted news coverage may strongly influence societal decisions, e.g., in democratic elections. Our paper makes three contributions to address this issue. First, we present a system for bias identification, which combines state-of-the-art methods from natural language understanding. Second, we devise bias-sensitive visualizations to communicate bias in news articles to non-expert news consumers. Third, our main contribution is a large-scale user study that measures bias-awareness in a setting that approximates daily news consumption, e.g., we present respondents with a news overview and individual articles. We not only measure the visualizations' effect on respondents' bias-awareness, but we can also pinpoint the effects on individual components of the visualizations by employing a conjoint design. Our bias-sensitive overviews strongly and significantly increase bias-awareness in respondents. Our study further suggests that our content-driven identification method detects groups of similarly slanted news articles due to substantial biases present in individual news articles. In contrast, the reviewed prior work rather only facilitates the visibility of biases, e.g., by distinguishing left- and right-wing outlets.

Via

Access Paper or Ask Questions

XCoref: Cross-document Coreference Resolution in the Wild

Sep 11, 2021

Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Bela Gipp

Figure 1 for XCoref: Cross-document Coreference Resolution in the Wild

Figure 2 for XCoref: Cross-document Coreference Resolution in the Wild

Figure 3 for XCoref: Cross-document Coreference Resolution in the Wild

Figure 4 for XCoref: Cross-document Coreference Resolution in the Wild

Abstract:Datasets and methods for cross-document coreference resolution (CDCR) focus on events or entities with strict coreference relations. They lack, however, annotating and resolving coreference mentions with more abstract or loose relations that may occur when news articles report about controversial and polarized events. Bridging and loose coreference relations trigger associations that may lead to exposing news readers to bias by word choice and labeling. For example, coreferential mentions of "direct talks between U.S. President Donald Trump and Kim" such as "an extraordinary meeting following months of heated rhetoric" or "great chance to solve a world problem" form a more positive perception of this event. A step towards bringing awareness of bias by word choice and labeling is the reliable resolution of coreferences with high lexical diversity. We propose an unsupervised method named XCoref, which is a CDCR method that capably resolves not only previously prevalent entities, such as persons, e.g., "Donald Trump," but also abstractly defined concepts, such as groups of persons, "caravan of immigrants," events and actions, e.g., "marching to the U.S. border." In an extensive evaluation, we compare the proposed XCoref to a state-of-the-art CDCR method and a previous method TCA that resolves such complex coreference relations and find that XCoref outperforms these methods. Outperforming an established CDCR model shows that the new CDCR models need to be evaluated on semantically complex mentions with more loose coreference relations to indicate their applicability of models to resolve mentions in the "wild" of political news articles.

Via

Access Paper or Ask Questions

Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

Sep 11, 2021

Anastasia Zhukova, Felix Hamborg, Bela Gipp

Figure 1 for Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

Figure 2 for Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

Figure 3 for Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

Figure 4 for Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

Abstract:Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.

Via

Access Paper or Ask Questions

Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Jul 02, 2021

Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Bela Gipp

Figure 1 for Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Figure 2 for Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Figure 3 for Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Figure 4 for Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Abstract:Unsupervised concept identification through clustering, i.e., identification of semantically related words and phrases, is a common approach to identify contextual primitives employed in various use cases, e.g., text dimension reduction, i.e., replace words with the concepts to reduce the vocabulary size, summarization, and named entity resolution. We demonstrate the first results of an unsupervised approach for the identification of groups of persons as actors extracted from a set of related articles. Specifically, the approach clusters mentions of groups of persons that act as non-named entity actors in the texts, e.g., "migrant families" = "asylum-seekers." Compared to our baseline, the approach keeps the mentions of the geopolitical entities separated, e.g., "Iran leaders" != "European leaders," and clusters (in)directly related mentions with diverse wording, e.g., "American officials" = "Trump Administration."

* Diversity, Divergence, Dialogue (2021) 514-526

Via

Access Paper or Ask Questions