Abstract:With the advancement of mid/post-training techniques, LLMs are pushing their boundaries at an accelerated pace. Legacy benchmarks saturate quickly (e.g., broad suites like MMLU over the years, newer ones like GPQA-D even faster), which makes frontier progress hard to track. The problem is especially acute in Korean: widely used benchmarks are fewer, often translated or narrow in scope, and updated more slowly, so saturation and contamination arrive sooner. Accordingly, at this moment, there is no Korean benchmark capable of evaluating and ranking frontier models. To bridge this gap, we introduce KAIO, a Korean, math-centric benchmark that stresses long-chain reasoning. Unlike recent Korean suites that are at or near saturation, KAIO remains far from saturated: the best-performing model, GPT-5, attains 62.8, followed by Gemini-2.5-Pro (52.3). Open models such as Qwen3-235B and DeepSeek-R1 cluster falls below 30, demonstrating substantial headroom, enabling robust tracking of frontier progress in Korean. To reduce contamination, KAIO will remain private and be served via a held-out evaluator until the best publicly known model reaches at least 80% accuracy, after which we will release the set and iterate to a harder version.
Abstract:Large language models often suffer from language confusion, a phenomenon where responses are partially or entirely generated in unintended languages. This can critically impact user experience in low-resource settings. We hypothesize that conventional supervised fine-tuning exacerbates this issue because the softmax objective focuses probability mass only on the single correct token but does not explicitly penalize cross-lingual mixing. Interestingly, by observing loss trajectories during the pretraining phase, we observe that models fail to learn to distinguish between monolingual and language-confused text. Additionally, we find that ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppresses language-confused generations even at high decoding temperatures without degrading overall model performance. Our findings suggest that incorporating appropriate penalty terms can mitigate language confusion in low-resource settings with limited data.