Abstract:Deep Neural Networks (DNNs) often struggle to suppress noise at low signal-to-noise ratios (SNRs). This paper addresses speech enhancement in scenarios dominated by harmonic noise and proposes a framework that integrates cyclostationarity-aware preprocessing with lightweight DNN-based denoising. A cyclic minimum power distortionless response (cMPDR) spectral beamformer is used as a preprocessing block. It exploits the spectral correlations of cyclostationary noise to suppress harmonic components prior to learning-based enhancement and does not require modifications to the DNN architecture. The proposed pipeline is evaluated in a single-channel setting using two DNN architectures: a simple and lightweight convolutional recurrent neural network (CRNN), and a state-of-the-art model, namely ultra-low complexity network (ULCNet). Experiments on synthetic data and real-world recordings dominated by rotating machinery noise demonstrate consistent improvements over end-to-end DNN baselines, particularly at low SNRs. Remarkably, a parameter-efficient CRNN with cMPDR preprocessing surpasses the performance of the larger ULCNet operating on raw or Wiener-filtered inputs. These results indicate that explicitly incorporating cyclostationarity as a signal prior is more effective than increasing model capacity alone for suppressing harmonic interference.
Abstract:Single-channel speech enhancement algorithms are often used in resource-constrained embedded devices, where low latency and low complexity designs gain more importance. In recent years, researchers have proposed a wide variety of novel solutions to this problem. In particular, a recent deep learning model named ULCNet is among the state-of-the-art approaches in this domain. This paper proposes an adaptation of ULCNet, by replacing its GRU layers with FastGRNNs, to reduce both computational latency and complexity. Furthermore, this paper shows empirical evidence on the performance decay of FastGRNNs in long audio signals during inference due to internal state drifting, and proposes a novel approach based on a trainable complementary filter to mitigate it. The resulting model, Fast-ULCNet, performs on par with the state-of-the-art original ULCNet architecture on a speech enhancement task, while reducing its model size by more than half and decreasing its latency by 34% on average.