Abstract:We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.
Abstract:We derive a scaling law relating ADC bit depth to effective bandwidth for signals with $1/f^α$ power spectra. Quantization introduces a flat noise floor whose intersection with the declining signal spectrum defines an effective cutoff frequency $f_c$. We show that each additional bit extends this cutoff by a factor of $2^{2/α}$, approximately doubling bandwidth per bit for $α= 2$. The law requires that quantization noise be approximately white, a condition whose minimum bit depth $N_{\min}$ we show to be $α$-dependent. Validation on synthetic $1/f^α$ signals for $α\in \{1.5, 2.0, 2.5\}$ yields prediction errors below 3\% using the theoretical noise floor $Δ^2/(6f_s)$, and approximately 14\% when the noise floor is estimated empirically from the quantized signal's spectrum. We illustrate practical implications on real EEG data.
Abstract:Alternative data representations are powerful tools that augment the performance of downstream models. However, there is an abundance of such representations within the machine learning toolbox, and the field lacks a comparative understanding of the suitability of each representation method. In this paper, we propose artifact detection and classification within EEG data as a testbed for profiling image-based data representations of time series data. We then evaluate eleven popular deep learning architectures on each of six commonly-used representation methods. We find that, while the choice of representation entails a choice within the tradeoff between bias and variance, certain representations are practically more effective in highlighting features which increase the signal-to-noise ratio of the data. We present our results on EEG data, and open-source our testing framework to enable future comparative analyses in this vein.