Abstract:To evaluate the performance of audio signal processing algorithms and to train data-driven algorithms, e.g., as applied in hearing instruments, either simulated or recorded data can be used. While large batches of simulated data can be generated using mathematical models, recorded data provide a more adequate representation of real-life scenarios. Therefore, in this paper, the Hearing Instrument Dataset in Various Acoustical Scenarios (HIDVAS) is introduced. This dataset consists of both impulse responses and audio recordings using eight external loudspeakers, two external microphones, and a dummy head. On this dummy head behind-the-ear (BTE) hearing instrument shells with two microphones per shell are mounted, and in the dummy head's ears receiver-in-canal (RIC) hearing instrument loudspeakers are inserted. The dummy head also contains microphones located at its eardrum. The impulse responses have been computed from a swept-sine recording for each microphone-loudspeaker pair, and the audio recordings have been obtained by playing back audio (male and female speech, speech shaped noise, singing voice, stringed instrument, wind instrument, and percussion instrument) through each individual loudspeaker and recording simultaneously using all microphones. These recordings have been repeated for four hearing instrument domes (open, semi-open, closed, and no-RIC) in three reverberation conditions in one room (T30 = 0.09 s, T30 = 0.47 s, and T30 = 0.73 s), and in one reverberation condition in a different room (T30 = 1.48 s). The usage of the dataset as a `hearing instrument in a box' is exemplified with three example use cases.
Abstract:In public address systems and hearing aids, the maximally achievable amplification or gain is limited by acoustic feedback. Therefore, in order to be able to apply a higher gain, feedback cancellation methods are required. In addition, it is oftentimes also desirable to dereverberate a recorded signal, that is, remove the late reverberation component of the signal, before playing it back. In this paper, it is shown that under two mild conditions, the acoustic feedback signal can be written as a reverberant version of the source signal. Therefore, it is possible to treat the joint dereverberation and acoustic feedback cancellation problem as a dereverberation-only problem, meaning that dereverberation algorithms can be applied to the joint problem. Simulations corroborate this finding
Abstract:In a wireless acoustic sensor network (WASN), devices (i.e., nodes) can collaborate through distributed algorithms to collectively perform audio signal processing tasks. This paper focuses on the distributed estimation of node-specific desired speech signals using network-wide Wiener filtering. The objective is to match the performance of a centralized system that would have access to all microphone signals, while reducing the communication bandwidth usage of the algorithm. Existing solutions, such as the distributed adaptive node-specific signal estimation (DANSE) algorithm, converge towards the multichannel Wiener filter (MWF) which solves a centralized linear minimum mean square error (LMMSE) signal estimation problem. However, they do so iteratively, which can be slow and impractical. Many solutions also assume that all nodes observe the same set of sources of interest, which is often not the case in practice. To overcome these limitations, we propose the distributed multichannel Wiener filter (dMWF) for fully connected WASNs. The dMWF is non-iterative and optimal even when nodes observe different sets of sources. In this algorithm, nodes exchange neighbor-pair-specific, low-dimensional (fused) signals estimating the contribution of sources observed by both nodes in the pair. We formally prove the optimality of dMWF and demonstrate its performance in simulated speech enhancement experiments. The proposed algorithm is shown to outperform DANSE in terms of objective metrics after short operation times, highlighting the benefit of its iterationless design.




Abstract:The localization of acoustic reflectors is a fundamental component in various applications, including room acoustics analysis, sound source localization, and acoustic scene analysis. Time Delay Estimation (TDE) is essential for determining the position of reflectors relative to a sensor array. Traditional TDE algorithms generally yield time delays that are integer multiples of the operating sampling period, potentially lacking sufficient time resolution. To achieve subsample TDE accuracy, various interpolation methods, including parabolic, Gaussian, frequency, and sinc interpolation, have been proposed. This paper presents a comprehensive study on time delay interpolation to achieve subsample accuracy for acoustic reflector localization in reverberant conditions. We derive the Whittaker-Shannon interpolation formula from the previously proposed sinc interpolation in the context of short-time windowed TDE for acoustic reflector localization. Simulations show that sinc and Whittaker-Shannon interpolation outperform existing methods in terms of time delay error and positional error for critically sampled and band-limited reflections. Performance is evaluated on real-world measurements from the MYRiAD dataset, showing that sinc and Whittaker-Shannon interpolation consistently provide reliable performance across different sensor-source pairs and loudspeaker positions. These results can enhance the precision of acoustic reflector localization systems, vital for applications such as room acoustics analysis, sound source localization, and acoustic scene analysis.




Abstract:Interactive acoustic auralization allows users to explore virtual acoustic environments in real-time, enabling the acoustic recreation of concert hall or Historical Worship Spaces (HWS) that are either no longer accessible, acoustically altered, or impractical to visit. Interactive acoustic synthesis requires real-time convolution of input signals with a set of synthesis filters that model the space-time acoustic response of the space. The acoustics in concert halls and HWS are both characterized by a long reverberation time, resulting in synthesis filters containing many filter taps. As a result, the convolution process can be computationally demanding, introducing significant latency that limits the real-time interactivity of the auralization system. In this paper, the implementation of a real-time multichannel loudspeaker-based auralization system is presented. This system is capable of synthesizing the acoustics of highly reverberant spaces in real-time using GPU-acceleration. A comparison between traditional CPU-based convolution and GPU-accelerated convolution is presented, showing that the latter can achieve real-time performance with significantly lower latency. Additionally, the system integrates acoustic synthesis with acoustic feedback cancellation on the GPU, creating a unified loudspeaker-based auralization framework that minimizes processing latency.
Abstract:Measuring room impulse responses (RIRs) at multiple spatial points is a time-consuming task, while simulations require detailed knowledge of the room's acoustic environment. In prior work, we proposed a method for estimating the early part of RIRs along a linear trajectory in a time-varying acoustic scenario involving a static sound source and a microphone moving at constant velocity. This approach relies on measured RIRs at the start and end points of the trajectory and assumes that the time intervals occupied by the direct sound and individual reflections along the trajectory are non-overlapping. The method's applicability is therefore restricted to relatively small areas within a room, and its performance has yet to be validated with real-world data. In this paper, we propose a practical extension of the method to more realistic scenarios by segmenting longer trajectories into smaller linear intervals where the assumptions approximately hold. Applying the method piecewise along these segments extends its applicability to more complex room environments. We demonstrate its effectiveness using the trajectoRIR database, which includes moving microphone recordings and RIR measurements at discrete points along a controlled L-shaped trajectory in a real room.




Abstract:Sound field reconstruction refers to the problem of estimating the acoustic pressure field over an arbitrary region of space, using only a limited set of measurements. Physics-informed neural networks have been adopted to solve the problem by incorporating in the training loss function the governing partial differential equation, either the Helmholtz or the wave equation. In this work, we introduce a boundary integral network for sound field reconstruction. Relying on the Kirchhoff-Helmholtz boundary integral equation to model the sound field in a given region of space, we employ a shallow neural network to retrieve the pressure distribution on the boundary of the considered domain, enabling to accurately retrieve the acoustic pressure inside of it. Assuming the positions of measurement microphones are known, we train the model by minimizing the mean squared error between the estimated and measured pressure at those locations. Experimental results indicate that the proposed model outperforms existing physics-informed data-driven techniques.

Abstract:Our everyday auditory experience is shaped by the acoustics of the indoor environments in which we live. Room acoustics modeling is aimed at establishing mathematical representations of acoustic wave propagation in such environments. These representations are relevant to a variety of problems ranging from echo-aided auditory indoor navigation to restoring speech understanding in cocktail party scenarios. Many disciplines in science and engineering have recently witnessed a paradigm shift powered by deep learning (DL), and room acoustics research is no exception. The majority of deep, data-driven room acoustics models are inspired by DL-based speech and image processing, and hence lack the intrinsic space-time structure of acoustic wave propagation. More recently, DL-based models for room acoustics that include either geometric or wave-based information have delivered promising results, primarily for the problem of sound field reconstruction. In this review paper, we will provide an extensive and structured literature review on deep, data-driven modeling in room acoustics. Moreover, we position these models in a framework that allows for a conceptual comparison with traditional physical and data-driven models. Finally, we identify strengths and shortcomings of deep, data-driven room acoustics models and outline the main challenges for further research.




Abstract:Data availability is essential to develop acoustic signal processing algorithms, especially when it comes to data-driven approaches that demand large and diverse training datasets. For this reason, an increasing number of databases have been published in recent years, including either room impulse responses (RIRs) or recordings of moving audio. In this paper we introduce the trajectoRIR database, an extensive, multi-array collection of both dynamic and stationary acoustic recordings along a controlled trajectory in a room. Specifically, the database features recordings using moving microphones and stationary RIRs spatially sampling the room acoustics along an L-shaped, 3.74-meter-long trajectory. This combination makes trajectoRIR unique and applicable in various tasks ranging from sound source localization and tracking to spatially dynamic sound field reconstruction and system identification. The recording room has a reverberation time of 0.5 seconds, and the three different microphone configurations employed include a dummy head, with additional reference microphones located next to the ears, 3 first-order Ambisonics microphones, two circular arrays of 16 and 4 channels, and a 12-channel linear array. The motion of the microphones was achieved using a robotic cart traversing a rail at three speeds: [0.2,0.4,0.8] m/s. Audio signals were reproduced using two stationary loudspeakers. The collected database features 8648 stationary RIRs, as well as perfect sweeps, speech, music, and stationary noise recorded during motion. MATLAB and Python scripts are included to access the recorded audio as well as to retrieve geometrical information.

Abstract:Two algorithms for combined acoustic echo cancellation (AEC) and noise reduction (NR) are analysed, namely the generalised echo and interference canceller (GEIC) and the extended multichannel Wiener filter (MWFext). Previously, these algorithms have been examined for linear echo paths, and assuming access to voice activity detectors (VADs) that separately detect desired speech and echo activity. However, algorithms implementing VADs may introduce detection errors. Therefore, in this paper, the previous analyses are extended by 1) modelling general nonlinear echo paths by means of the generalised Bussgang decomposition, and 2) modelling VAD error effects in each specific algorithm, thereby also allowing to model specific VAD assumptions. It is found and verified with simulations that, generally, the MWFext achieves a higher NR performance, while the GEIC achieves a more robust AEC performance.