Bpe


Separate Before You Compress: The WWHO Tokenization Architecture

Add code
Mar 26, 2026
Viaarxiv icon

SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

Add code
Mar 21, 2026
Viaarxiv icon

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

Add code
Mar 17, 2026
Viaarxiv icon

Batched Kernelized Bandits: Refinements and Extensions

Add code
Mar 13, 2026
Viaarxiv icon

PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization

Add code
Mar 13, 2026
Viaarxiv icon

Graph Tokenization for Bridging Graphs and Transformers

Add code
Mar 11, 2026
Viaarxiv icon

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

Add code
Mar 10, 2026
Viaarxiv icon

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Add code
Mar 03, 2026
Viaarxiv icon

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Add code
Mar 03, 2026
Viaarxiv icon

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Add code
Mar 03, 2026
Viaarxiv icon