Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sungmok Jung

Models Know Models Best: Evaluation via Model-Preferred Formats

Jan 30, 2026

Joonhak Lee, Sungmok Jung, Jongyeon Park, Jaejin Lee

Abstract:Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.

Via

Access Paper or Ask Questions

Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

Jan 08, 2026

Sungmok Jung, Yeonkyoung So, Joonhak Lee, Sangho Kim, Yelim Ahn, Jaejin Lee

Abstract:Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.

Via

Access Paper or Ask Questions

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Jun 18, 2025

Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee

Figure 1 for Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Figure 2 for Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Figure 3 for Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Figure 4 for Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Abstract:Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models' negation understanding.

Via

Access Paper or Ask Questions

Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Jun 18, 2025

Gyeongje Cho, Yeonkyoun So, Chanwoo Park, Sangmin Lee, Sungmok Jung, Jaejin Lee

Figure 1 for Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Figure 2 for Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Figure 3 for Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Figure 4 for Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Abstract:This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

Via

Access Paper or Ask Questions