Despite the success of Transformers in self-supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, prior to applying them to downstream tasks. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report wall-clock time, the number of parameters, and the number of multiply-accumulate operations for these techniques, charting the landscape of compressing Transformer-based self-supervised models.
Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. HuBERT, in particular, achieves strong performance while being relatively simple in training compared to others. The original experimental setting is computationally extensive, hindering the reproducibility of the models. It is also unclear why certain design decisions are made, such as the ad-hoc loss function, and whether these decisions have an impact on the learned representations. We propose MelHuBERT, a simplified version of HuBERT that takes Mel spectrograms as input, significantly reducing computation and memory consumption. We study several aspects of training, including the loss function, multi-stage training, and streaming options. Our result is a efficient yet performant model that can be trained on a single GPU.
We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency. The challenge builds upon the SUPERB benchmark and implements metrics to measure the computation requirements of self-supervised learning (SSL) representation and to evaluate its generalizability and performance across the diverse SUPERB tasks. The SUPERB benchmark provides comprehensive coverage of popular speech processing tasks, from speech and speaker recognition to audio generation and semantic understanding. As SSL has gained interest in the speech community and showed promising outcomes, we envision the challenge to uplevel the impact of SSL techniques by motivating more practical designs of techniques beyond task performance. We summarize the results of 14 submitted models in this paper. We also discuss the main findings from those submissions and the future directions of SSL research.