Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Evangelia Zve

Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings

May 11, 2026

Benjamin Icard, Lila Sainero, Alice Breton, Evangelia Zve, Jean-Gabriel Ganascia

Abstract:Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.

* To appear in the Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)

Via

Access Paper or Ask Questions

From Noise to Signal: When Outliers Seed New Topics

Mar 18, 2026

Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia

Abstract:Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

* To appear in the Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026)

Via

Access Paper or Ask Questions

From Outliers to Topics in Language Models: Anticipating Trends in News Corpora

Sep 26, 2025

Evangelia Zve, Benjamin Icard, Alice Breton, Lila Sainero, Gauvain Bourgne, Jean-Gabriel Ganascia

Abstract:This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.

* presented at ICNLSP 2025; to appear in the ACL Anthology; received the Best Full Paper Award

Via

Access Paper or Ask Questions

Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Jan 01, 2025

Benjamin Icard, Evangelia Zve, Lila Sainero, Alice Breton, Jean-Gabriel Ganascia

Figure 1 for Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Figure 2 for Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Figure 3 for Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Figure 4 for Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Abstract:This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.

* To appear in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi

Via

Access Paper or Ask Questions