Abstract:Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.
Abstract:The global spread of COVID-19 had severe consequences for public health and the world economy. The quick onset of the pandemic highlighted the potential benefits of cheap and deployable pre-screening methods to monitor the prevalence of the disease in a population. Various researchers made use of machine learning methods in an attempt to detect COVID-19. The solutions leverage various input features, such as CT scans or cough audio signals, with state-of-the-art results arising from deep neural network architectures. However, larger models require more compute; a pertinent consideration when deploying to the edge. To address this, we first recreated two models that use cough audio recordings to detect COVID-19. Through applying network pruning and quantisation, we were able to compress these two architectures without reducing the model's predictive performance. Specifically, we were able to achieve an 105.76x and an 19.34x reduction in the compressed model file size with corresponding 1.37x and 1.71x reductions in the inference times of the two models.