![Figure 1 for It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers](/_next/image?url=https%3A%2F%2Ffigures.semanticscholar.org%2F02b955cfb60fb55c9005095a40d631f93cb17fdb%2F2-Figure1-1.png&w=640&q=75)
![Figure 2 for It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers](/_next/image?url=https%3A%2F%2Ffigures.semanticscholar.org%2F02b955cfb60fb55c9005095a40d631f93cb17fdb%2F4-Table1-1.png&w=640&q=75)
![Figure 3 for It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers](/_next/image?url=https%3A%2F%2Ffigures.semanticscholar.org%2F02b955cfb60fb55c9005095a40d631f93cb17fdb%2F4-Figure2-1.png&w=640&q=75)
![Figure 4 for It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers](/_next/image?url=https%3A%2F%2Ffigures.semanticscholar.org%2F02b955cfb60fb55c9005095a40d631f93cb17fdb%2F6-Table2-1.png&w=640&q=75)
While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability compared to decoder-based large language models (LLMs). In this work, we introduce ModernBERT-Large-Instruct, a 0.4B-parameter encoder model that leverages its masked language modelling (MLM) head for generative classification. Our approach employs an intentionally simple training loop and inference mechanism that requires no heavy pre-processing, heavily engineered prompting, or architectural modifications. ModernBERT-Large-Instruct exhibits strong zero-shot performance on both classification and knowledge-based tasks, outperforming similarly sized LLMs on MMLU and achieving 93% of Llama3-1B's MMLU performance with 60% less parameters. We also demonstrate that, when fine-tuned, the generative approach using the MLM head matches or even surpasses traditional classification-head methods across diverse NLU tasks.This capability emerges specifically in models trained on contemporary, diverse data mixes, with models trained on lower volume, less-diverse data yielding considerably weaker performance. Although preliminary, these results demonstrate the potential of using the original generative masked language modelling head over traditional task-specific heads for downstream tasks. Our work suggests that further exploration into this area is warranted, highlighting many avenues for future improvements.