Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mihai Dascalu

"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

Jun 01, 2026

Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

Abstract:Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data. We further curate HoraVQA, a culturally native evaluation set grounded in Romanian everyday scenes. Romanian-adapted VLMs consistently outperform their same-sized counterparts and, across all evaluated benchmarks, even surpass models from the next larger size category.

Via

Access Paper or Ask Questions

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

May 26, 2026

Hunter McNichols, Alexander Scarlatos, Mihai Dascalu, Danielle McNamara, Andrew Lan

Abstract:An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $β$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

* preprint

Via

Access Paper or Ask Questions

Neural Grammatical Error Correction for Romanian

Apr 26, 2026

Teodor-Mihai Cotet, Stefan Ruseti, Mihai Dascalu

Abstract:Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger

Via

Access Paper or Ask Questions

Value-Aware Numerical Representations for Transformer Language Models

Jan 14, 2026

Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu

Abstract:Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.

Via

Access Paper or Ask Questions

Training Language Models with homotokens Leads to Delayed Overfitting

Jan 06, 2026

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

Abstract:Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.

* 8 pages, 6 figures, 3 Appendices

Via

Access Paper or Ask Questions

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

May 21, 2025

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

Abstract:Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

* 1 Table, 8 Figures

Via

Access Paper or Ask Questions

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Oct 04, 2024

Adrian Cosma, Stefan Ruseti, Mihai Dascalu, Cornelia Caragea

Figure 1 for How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Figure 2 for How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Figure 3 for How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Figure 4 for How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Abstract:Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.

* Accepted at EMNLP 2024 Main Conference

Via

Access Paper or Ask Questions

"Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions

Jun 26, 2024

Mihai Masala, Denis C. Ilie-Ablachim, Alexandru Dima, Dragos Corlatescu, Miruna Zavelca, Ovio Olaru, Simina Terian-Dan, Andrei Terian-Dan, Marius Leordeanu, Horia Velicu(+3 more)

Figure 1 for "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions

Figure 2 for "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions

Figure 3 for "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions

Figure 4 for "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions

Abstract:In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.

* arXiv admin note: text overlap with arXiv:2405.07703

Via

Access Paper or Ask Questions

OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs

May 17, 2024

Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea

Figure 1 for OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs

Figure 2 for OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs

Figure 3 for OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs

Figure 4 for OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs

Via

Access Paper or Ask Questions

UPB @ ACTI: Detecting Conspiracies using fine tuned Sentence Transformers

Sep 28, 2023

Andrei Paraschiv, Mihai Dascalu

Abstract:Conspiracy theories have become a prominent and concerning aspect of online discourse, posing challenges to information integrity and societal trust. As such, we address conspiracy theory detection as proposed by the ACTI @ EVALITA 2023 shared task. The combination of pre-trained sentence Transformer models and data augmentation techniques enabled us to secure first place in the final leaderboard of both sub-tasks. Our methodology attained F1 scores of 85.71% in the binary classification and 91.23% for the fine-grained conspiracy topic classification, surpassing other competing systems.

Via

Access Paper or Ask Questions