Abstract:When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-\alpha}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-\alpha' \sqrt[m]{d})$.
Abstract:Following the success of the natural language processing, the transformer for vision applications has attracted significant attention in recent years due to its excellent performance. However, existing deep learning hardware accelerators for vision cannot execute this structure efficiently due to significant model architecture differences. As a result, this paper proposes the hardware accelerator for vision transformers with row-wise scheduling, which decomposes major operations in vision transformers as a single dot product primitive for a unified and efficient execution. Furthermore, by sharing weights in columns, we can reuse the data and reduce the usage of memory. The implementation with TSMC 40nm CMOS technology only requires 262K gate count and 149KB SRAM buffer for 403.2 GOPS throughput at 600MHz clock frequency.