Abstract:Surface electromyography (sEMG) signals exhibit substantial inter-subject variability and are highly susceptible to noise, posing challenges for robust and interpretable decoding. To address these limitations, we propose a discrete representation of sEMG signals based on a physiology-informed tokenization framework. The method employs a sliding window aligned with the minimal muscle contraction cycle to isolate individual muscle activation events. From each window, ten time-frequency features, including root mean square (RMS) and median frequency (MDF), are extracted, and K-means clustering is applied to group segments into representative muscle-state tokens. We also introduce a large-scale benchmark dataset, ActionEMG-43, comprising 43 diverse actions and sEMG recordings from 16 major muscle groups across the body. Based on this dataset, we conduct extensive evaluations to assess the inter-subject consistency, representation capacity, and interpretability of the proposed sEMG tokens. Our results show that the token representation exhibits high inter-subject consistency (Cohen's Kappa = 0.82+-0.09), indicating that the learned tokens capture consistent and subject-independent muscle activation patterns. In action recognition tasks, models using sEMG tokens achieve Top-1 accuracies of 75.5% with ViT and 67.9% with SVM, outperforming raw-signal baselines (72.8% and 64.4%, respectively), despite a 96% reduction in input dimensionality. In movement quality assessment, the tokens intuitively reveal patterns of muscle underactivation and compensatory activation, offering interpretable insights into neuromuscular control. Together, these findings highlight the effectiveness of tokenized sEMG representations as a compact, generalizable, and physiologically meaningful feature space for applications in rehabilitation, human-machine interaction, and motor function analysis.