Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yoshifumi Kawasaki

Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs

Feb 10, 2026

Yoshifumi Kawasaki

Abstract:This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.

Via

Access Paper or Ask Questions

Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

May 19, 2023

Ryo Nagata, Hiroya Takamura, Naoki Otani, Yoshifumi Kawasaki

Figure 1 for Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

Figure 2 for Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

Figure 3 for Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

Figure 4 for Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

Abstract:In this paper, we propose methods for discovering semantic differences in words appearing in two corpora based on the norms of contextualized word vectors. The key idea is that the coverage of meanings is reflected in the norm of its mean word vector. The proposed methods do not require the assumptions concerning words and corpora for comparison that the previous methods do. All they require are to compute the mean vector of contextualized word vectors and its norm for each word type. Nevertheless, they are (i) robust for the skew in corpus size; (ii) capable of detecting semantic differences in infrequent words; and (iii) effective in pinpointing word instances that have a meaning missing in one of the two corpora for comparison. We show these advantages for native and non-native English corpora and also for historical corpora.

Via

Access Paper or Ask Questions