Speech enhancement is a fundamental challenge in signal processing, particularly when robustness is required across diverse acoustic conditions and microphone setups. Deep learning methods have been successful for speech enhancement, but often assume fixed array geometries, limiting their use in mobile, embedded, and wearable devices. Existing array-agnostic approaches typically rely on either raw microphone signals or beamformer outputs, but both have drawbacks under changing geometries. We introduce HyBeam, a hybrid framework that uses raw microphone signals at low frequencies and beamformer signals at higher frequencies, exploiting their complementary strengths while remaining highly array-agnostic. Simulations across diverse rooms and wearable array configurations demonstrate that HyBeam consistently surpasses microphone-only and beamformer-only baselines in PESQ, STOI, and SI-SDR. A bandwise analysis shows that the hybrid approach leverages beamformer directivity at high frequencies and microphone cues at low frequencies, outperforming either method alone across all bands.

Speech separation and enhancement (SSE) has advanced remarkably and achieved promising results in controlled settings, such as a fixed number of speakers and a fixed array configuration. Towards a universal SSE system, single-channel systems have been extended to deal with a variable number of speakers (i.e., outputs). Meanwhile, multi-channel systems accommodating various array configurations (i.e., inputs) have been developed. However, these attempts have been pursued separately. In this paper, we propose a flexible input and output SSE system, named FlexIO. It performs conditional separation using prompt vectors, one per speaker as a condition, allowing separation of an arbitrary number of speakers. Multi-channel mixtures are processed together with the prompt vectors via an array-agnostic channel communication mechanism. Our experiments demonstrate that FlexIO successfully covers diverse conditions with one to five microphones and one to three speakers. We also confirm the robustness of FlexIO on CHiME-4 real data.

The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed augmentation. This FAIR compliant resource addresses major Big Data challenges - volume (90 hours of recordings, 65.6 GB), variety (multi farm and multi zone acoustics), velocity (real time processing), and veracity (noise robust feature extraction). Our distributed processing framework integrates advanced denoising using iZotope RX, multimodal synchronization through audio and video alignment, and standardized feature engineering with 24 acoustic descriptors generated from Praat, librosa, and openSMILE. Preliminary benchmarks reveal distinct class level acoustic patterns for estrus detection, distress classification, and maternal communication. The datasets ecological realism, reflecting authentic barn acoustics rather than controlled settings, ensures readiness for field deployment. This work establishes a foundation for animal centered AI, where bioacoustic data enable continuous and non invasive welfare assessment at industrial scale. By releasing standardized pipelines and detailed metadata, we promote reproducible research that connects Big Data analytics, sustainable agriculture, and precision livestock management. The framework supports UN SDG 9, showing how data science can turn traditional farming into intelligent, welfare optimized systems that meet global food needs while upholding ethical animal care.

The steered response power (SRP) method is one of the most popular approaches for acoustic source localization with microphone arrays. It is often based on simplifying acoustic assumptions, such as an omnidirectional sound source in the far field of the microphone array(s), free field propagation, and spatially uncorrelated noise. In reality, however, there are many acoustic scenarios where such assumptions are violated. This paper proposes a generalization of the conventional SRP method that allows to apply generic acoustic models for localization with arbitrary microphone constellations. These models may consider, for instance, level differences in distributed microphones, the directivity of sources and receivers, or acoustic shadowing effects. Moreover, also measured acoustic transfer functions may be applied as acoustic model. We show that the delay-and-sum beamforming of the conventional SRP is not optimal for localization with generic acoustic models. To this end, we propose a generalized SRP beamforming criterion that considers generic acoustic models and spatially correlated noise, and derive an optimal SRP beamformer. Furthermore, we propose and analyze appropriate frequency weightings. Unlike the conventional SRP, the proposed method can jointly exploit observed level and time differences between the microphone signals to infer the source location. Realistic simulations of three different microphone setups with speech under various noise conditions indicate that the proposed method can significantly reduce the mean localization error compared to the conventional SRP and, in particular, a reduction of more than 60% can be archived in noisy conditions.

Multichannel speech enhancement leverages spatial cues to improve intelligibility and quality, but most learning-based methods rely on specific microphone array geometry, unable to account for geometry changes. To mitigate this limitation, current array-agnostic approaches employ large multi-geometry datasets but may still fail to generalize to unseen layouts. We propose AmbiDrop (Ambisonics with Dropouts), an Ambisonics-based framework that encodes arbitrary array recordings into the spherical harmonics domain using Ambisonics Signal Matching (ASM). A deep neural network is trained on simulated Ambisonics data, combined with channel dropout for robustness against array-dependent encoding errors, therefore omitting the need for a diverse microphone array database. Experiments show that while the baseline and proposed models perform similarly on the training arrays, the baseline degrades on unseen arrays. In contrast, AmbiDrop consistently improves SI-SDR, PESQ, and STOI, demonstrating strong generalization and practical potential for array-agnostic speech enhancement.

Audio tagging aims to label sound events appearing in an audio recording. In this paper, we propose region-specific audio tagging, a new task which labels sound events in a given region for spatial audio recorded by a microphone array. The region can be specified as an angular space or a distance from the microphone. We first study the performance of different combinations of spectral, spatial, and position features. Then we extend state-of-the-art audio tagging systems such as pre-trained audio neural networks (PANNs) and audio spectrogram transformer (AST) to the proposed region-specific audio tagging task. Experimental results on both the simulated and the real datasets show the feasibility of the proposed task and the effectiveness of the proposed method. Further experiments show that incorporating the directional features is beneficial for omnidirectional tagging.

Spherical microphone arrays (SMAs) are widely used for sound field analysis, and sparse recovery (SR) techniques can significantly enhance their spatial resolution by modeling the sound field as a sparse superposition of dominant plane waves. However, the spatial resolution of SMAs is fundamentally limited by their spherical harmonic order, and their performance often degrades in reverberant environments. This paper proposes a two-stage SR framework with residue refinement that integrates observations from a central SMA and four surrounding linear microphone arrays (LMAs). The core idea is to exploit complementary spatial characteristics by treating the SMA as a primary estimator and the LMAs as a spatially complementary refiner. Simulation results demonstrate that the proposed SMA-LMA method significantly enhances spatial energy map reconstruction under varying reverberation conditions, compared to both SMA-only and direct one-step joint processing. These results demonstrate the effectiveness of the proposed framework in enhancing spatial fidelity and robustness in complex acoustic environments.

In this paper, we present an acoustic database, designed to drive and support research on voiced enabled technologies inside moving vehicles. The recording process involves (i) recordings of acoustic impulse responses, acquired under static conditions to provide the means for modeling the speech and car-audio components (ii) recordings of acoustic noise at a wide range of static and in-motion conditions. Data are recorded with two different microphone configurations, particularly (i) a compact microphone array and (ii) a distributed microphone setup. We briefly describe the conditions under which the recordings were acquired, and we provide insight into a Python API that we designed to support the research and development of voice-enabled technologies inside moving vehicles. The first version of this Python API and part of the described dataset are available for free download.

Line differential microphone arrays have attracted attention for their ability to achieve frequency-invariant beampatterns and high directivity. Recently, the Jacobi-Anger expansion-based approach has enabled the design of fully steerable-invariant differential beamformers for line arrays combining omnidirectional and directional microphones. However, this approach relies on the analytical expression of the ideal beam pattern and the proper selection of truncation order, which is not always practical. This paper introduces a null-constraint-based method for designing frequency- and steerable-invariant differential beamformers using a line array of omnidirectional and directional microphones. The approach employs a multi-constraint optimisation framework, where the reference filter and ideal beam pattern are first determined based on specified nulls and desired direction. Subsequently, the white noise gain constraint is derived from the reference filter, and the beampattern constraint is from the ideal beam pattern. The optimal filter is then obtained by considering constraints related to the beampattern, nulls, and white noise gain. This method achieves a balance between white noise gain and mean square error, allowing robust, frequency- and steerableinvariant differential beamforming performance. It addresses limitations in beam pattern flexibility and truncation errors, offering greater design freedom and improved practical applicability. Simulations and experiments demonstrate that this method outperforms the Jacobi-Anger expansion-based approach in three key aspects: an extended effective range, improved main lobe and null alignment, and greater flexibility in microphone array configuration and beam pattern design, requiring only steering direction and nulls instead of an analytic beam pattern expression.

In this paper, we introduce a neural network-based method for regional speech separation using a microphone array. This approach leverages novel spatial cues to extract the sound source not only from specified direction but also within defined distance. Specifically, our method employs an improved delay-and-sum technique to obtain directional cues, substantially enhancing the signal from the target direction. We further enhance separation by incorporating the direct-to-reverberant ratio into the input features, enabling the model to better discriminate sources within and beyond a specified distance. Experimental results demonstrate that our proposed method leads to substantial gains across multiple objective metrics. Furthermore, our method achieves state-of-the-art performance on the CHiME-8 MMCSG dataset, which was recorded in real-world conversational scenarios, underscoring its effectiveness for speech separation in practical applications.
