Abstract:In distant microphones, broadband (BB) methods for direction-of-arrival (DoA) estimation are more suitable than narrowband (NB) methods. Due to the aggregation of their optimization function across all frequency bands, BB estimators are robust to spatial aliasing, a known problem in processing distant microphone data. In NB methods, DoA estimation is performed by utilizing \textit{local} information in each frequency band and hence the estimation is affected by spatial aliasing. However, unlike BB methods, NB methods exploit frequency sparsity to estimate the DoAs of \textit{multiple speakers} in a \textit{single time frame}. In this article, a method to improve the robustness of a NB DoA estimator to spatial aliasing is developed. The proposed method is based on cross-correlation of speech-present time-frequency regions obtained by single frequency filtering (SFF) of the microphone signals. The SFF spectrum is chosen because SFF components have regions of high signal-to-noise ratio both in time and frequency and because speech and non-speech discrimination is robust to degradations in the SFF domain. The proposed NB estimator is compared to four state-of-the-art estimators (one NB and three BB) using detection and accuracy metrics on simulated and real-world data in different reverberation and noise conditions. The results show that in all the environments, the SFF-based NB approach outperforms the state-of-the-art NB approach. Furthermore, the performance of the SFF-based approach is better than some of the BB estimators.
Abstract:Robust direction-of-arrival (DoA) estimation from noisy and reverberant microphone signals remains challenging. Conventional estimators such as generalized cross-correlation (GCC) and its variants operate in the short-time Fourier transform (STFT) domain, where spectral features primarily reflect vocal-tract characteristics. Recent single frequency filtering (SFF)-based estimators instead use a time-frequency representation that provides high spectral resolution of harmonics along with high temporal resolution of excitation-source events, such as epoch-like impulses. Since excitation-source features have been shown to be more robust to noise and reverberation than spectral features, this work proposes an improved SFF-based DoA estimator that correlates the envelopes of SFF outputs across microphone channels using PHAT-weighted GCC. We further provide a comprehensive evaluation of SFF-based and state-of-the-art GCC-based estimators using publicly available real-room recordings under challenging reverberant, multi-speaker, and noise-corrupted conditions. Experimental results show that the proposed method and an existing SFF-based estimator achieve detection and accuracy performance that is superior or comparable to the best GCC-based estimator across all test cases. We also demonstrate that using speech-dominant bins improves GCC-PHAT robustness, motivating future incorporation of such weighting strategies into SFF-based DoA estimation.