Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roman Yangarber

FRaN-X: FRaming and Narratives-eXplorer

Jul 09, 2025

Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie, Daniil Orel, Aaryamonvikram Singh, Yuxia Wang, Aadi Joshi, Hasan Iqbal(+14 more)

Abstract:We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity's role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.

* 19 pages, 13 figures, submitted to EMNLP 2025 - Demo Track

Via

Access Paper or Ask Questions

Entity Framing and Role Portrayal in the News

Feb 20, 2025

Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino(+2 more)

Figure 1 for Entity Framing and Role Portrayal in the News

Figure 2 for Entity Framing and Role Portrayal in the News

Figure 3 for Entity Framing and Role Portrayal in the News

Figure 4 for Entity Framing and Role Portrayal in the News

Abstract:We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.

* 23 pages, 12 figures. Submitted to ACL Rolling Review (ARR)

Via

Access Paper or Ask Questions

Implicit assessment of language learning during practice as accurate as explicit testing

Sep 24, 2024

Jue Hou, Anisia Katinskaia, Anh-Duc Vu, Roman Yangarber

Figure 1 for Implicit assessment of language learning during practice as accurate as explicit testing

Figure 2 for Implicit assessment of language learning during practice as accurate as explicit testing

Figure 3 for Implicit assessment of language learning during practice as accurate as explicit testing

Figure 4 for Implicit assessment of language learning during practice as accurate as explicit testing

Abstract:Assessment of proficiency of the learner is an essential part of Intelligent Tutoring Systems (ITS). We use Item Response Theory (IRT) in computer-aided language learning for assessment of student ability in two contexts: in test sessions, and in exercises during practice sessions. Exhaustive testing across a wide range of skills can provide a detailed picture of proficiency, but may be undesirable for a number of reasons. Therefore, we first aim to replace exhaustive tests with efficient but accurate adaptive tests. We use learner data collected from exhaustive tests under imperfect conditions, to train an IRT model to guide adaptive tests. Simulations and experiments with real learner data confirm that this approach is efficient and accurate. Second, we explore whether we can accurately estimate learner ability directly from the context of practice with exercises, without testing. We transform learner data collected from exercise sessions into a form that can be used for IRT modeling. This is done by linking the exercises to {\em linguistic constructs}; the constructs are then treated as "items" within IRT. We present results from large-scale studies with thousands of learners. Using teacher assessments of student ability as "ground truth," we compare the estimates obtained from tests vs. those from exercises. The experiments confirm that the IRT models can produce accurate ability estimation based on exercises.

Via

Access Paper or Ask Questions

Probing the Category of Verbal Aspect in Transformer Language Models

Jun 04, 2024

Anisia Katinskaia, Roman Yangarber

Figure 1 for Probing the Category of Verbal Aspect in Transformer Language Models

Figure 2 for Probing the Category of Verbal Aspect in Transformer Language Models

Figure 3 for Probing the Category of Verbal Aspect in Transformer Language Models

Figure 4 for Probing the Category of Verbal Aspect in Transformer Language Models

Abstract:We investigate how pretrained language models (PLM) encode the grammatical category of verbal aspect in Russian. Encoding of aspect in transformer LMs has not been studied previously in any language. A particular challenge is posed by "alternative contexts": where either the perfective or the imperfective aspect is suitable grammatically and semantically. We perform probing using BERT and RoBERTa on alternative and non-alternative contexts. First, we assess the models' performance on aspect prediction, via behavioral probing. Next, we examine the models' performance when their contextual representations are substituted with counterfactual representations, via causal probing. These counterfactuals alter the value of the "boundedness" feature--a semantic feature, which characterizes the action in the context. Experiments show that BERT and RoBERTa do encode aspect--mostly in their final layers. The counterfactual interventions affect perfective and imperfective in opposite ways, which is consistent with grammar: perfective is positively affected by adding the meaning of boundedness, and vice versa. The practical implications of our probing results are that fine-tuning only the last layers of BERT on predicting aspect is faster and more effective than fine-tuning the whole model. The model has high predictive uncertainty about aspect in alternative contexts, which tend to lack explicit hints about the boundedness of the described action.

Via

Access Paper or Ask Questions

GPT-3.5 for Grammatical Error Correction

May 14, 2024

Anisia Katinskaia, Roman Yangarber

Figure 1 for GPT-3.5 for Grammatical Error Correction

Figure 2 for GPT-3.5 for Grammatical Error Correction

Figure 3 for GPT-3.5 for Grammatical Error Correction

Figure 4 for GPT-3.5 for Grammatical Error Correction

Abstract:This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendi test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.

Via

Access Paper or Ask Questions

What do Transformers Know about Government?

Apr 22, 2024

Jue Hou, Anisia Katinskaia, Lari Kotilainen, Sathianpong Trangcasanchai, Anh-Duc Vu, Roman Yangarber

Figure 1 for What do Transformers Know about Government?

Figure 2 for What do Transformers Know about Government?

Figure 3 for What do Transformers Know about Government?

Figure 4 for What do Transformers Know about Government?

Abstract:This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language models.In particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show that information about government is encoded across all transformer layers, but predominantly in the early layers of the model. We find that, for both languages, a small number of attention heads encode enough information about the government relations to enable us to train a classifier capable of discovering new, previously unknown types of government, never seen in the training data. Currently, data is lacking for the research community working on grammatical constructions, and government in particular. We release the Government Bank -- a dataset defining the government relations for thousands of lemmas in the languages in our experiments.

Via

Access Paper or Ask Questions

Cross-lingual Named Entity Corpus for Slavic Languages

Apr 07, 2024

Jakub Piskorski, Michał Marcińczuk, Roman Yangarber

Figure 1 for Cross-lingual Named Entity Corpus for Slavic Languages

Figure 2 for Cross-lingual Named Entity Corpus for Slavic Languages

Figure 3 for Cross-lingual Named Entity Corpus for Slavic Languages

Figure 4 for Cross-lingual Named Entity Corpus for Slavic Languages

Abstract:This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.

* Published in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

Via

Access Paper or Ask Questions

Investigating the effect of sub-word segmentation on the performance of transformer language models

May 09, 2023

Jue Hou, Anisia Katinskaia, Anh-Duc Vu, Roman Yangarber

Figure 1 for Investigating the effect of sub-word segmentation on the performance of transformer language models

Figure 2 for Investigating the effect of sub-word segmentation on the performance of transformer language models

Figure 3 for Investigating the effect of sub-word segmentation on the performance of transformer language models

Figure 4 for Investigating the effect of sub-word segmentation on the performance of transformer language models

Abstract:We would like to explore how morphemes can affect the performance of a language model. We trained GPT-2 and Bert model with StateMorph for both Finnish and Russian, which is a morpheme segmenting algorithm. As a comparison, we also trained a model with BPE and Morfessor. Our preliminary result shows that StateMorph can help the model to converge more efficiently and achieve a better validation score.

Via

Access Paper or Ask Questions

Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

Dec 03, 2022

Anisia Katinskaia, Jue Hou, Anh-Duc Vu, Roman Yangarber

Figure 1 for Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

Figure 2 for Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

Figure 3 for Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

Figure 4 for Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

Abstract:This paper presents the development of an AI-based language learning platform Revita. It is a freely available intelligent online tutor, developed to support learners of multiple languages, from low-intermediate to advanced levels. It has been in pilot use by hundreds of students at several universities, whose feedback and needs are shaping the development. One of the main emerging features of Revita is the introduction of a system of linguistic constructs as the representation of domain knowledge. The system of constructs is developed in close collaboration with experts in language teaching. Constructs define the types of exercises, the content of the feedback, and enable the detailed modeling and evaluation of learning progress.

Via

Access Paper or Ask Questions

Question Answering and Question Generation for Finnish

Nov 24, 2022

Ilmari Kylliäinen, Roman Yangarber

Abstract:Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting with state-of-the-art QA/QG fine-tuning methods. We present the first neural QA and QG models that work with Finnish. To train the models, we automatically translate the SQuAD dataset and then use normalization methods to reduce the amount of problematic data created during the translation. Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG and evaluate their performance. To the best of our knowledge, the resulting dataset is the first large-scale QA/QG resource for Finnish. This paper also sets the initial benchmarks for Finnish-language QA and QG.

Via

Access Paper or Ask Questions