Abstract:While pre-trained models excel in specialized tasks, learning universal representations across diverse acoustic domains remains challenging. To address this, we propose WQ-Fusion, a robust dual-encoder framework for cross-domain audio representation learning. Overcoming the limitations of static concatenation, WQ-Fusion integrates whisper and qwen via an Adaptive Feature Modulation module and a novel element-wise gated attention mechanism. This design enables dynamic feature selection, allowing the model to selectively emphasize relevant acoustic and semantic dimensions. Extensive experiments on the Interspeech 2026 Audio Encoder Capability Challenge (Track A) benchmark demonstrate that by effectively routing heterogeneous information, WQ-Fusion achieves a superior overall score of 0.836, significantly outperforming the strongest single-encoder baseline.
Abstract:The minimum variance distortionless response (MVDR) beamformer is widely used for multichannel speech enhancement due to strong noise suppression while preserving target signals. In practice, its performance is sensitive to microphone self-noise and array mismatches. Existing approaches typically rely on fixed, manually tuned WNG thresholds or diagonal loading, leading to suboptimal performance under unknown or time-varying acoustic conditions. This paper proposes a data-driven MVDR framework that adaptively estimates the WNG constraint using a deep neural network. The network jointly predicts a time-frequency noise mask for covariance estimation and a frequency-dependent WNG threshold, enabling dynamic robustness-directivity control. A differentiable robust MVDR layer is integrated into the framework, allowing end-to-end optimization. Experiments demonstrate consistent improvements in speech quality and intelligibility over conventional fixed-WNG MVDR methods.
Abstract:Accurate phoneme recognition is pivotal for mispronunciation detection and diagnosis (MDD) in modern standard Arabic (MSA), yet remains constrained by data scarcity and the synthetic-real domain gap. This work proposes a two-stage end-to-end framework. It integrates a pre-trained encoder with causal dilated temporal convolutional networks to preserve fine-grained phonetic variations. A hierarchical two-stage strategy first learns general mappings from native/synthetic corpora, then adapts to scarce real learner data to mitigate domain shift without over-correction. Prediction stability is further enhanced via multi-checkpoint ensemble inference with N-gram rescoring. Evaluated on the QuranMB.v2 test set, our system achieves an F1-score of $0.7201$, a $63.1$\% relative improvement over baseline ($0.4414$). This performance ranks at the top of the IqraEval.2 Challenge, establishing a new state-of-the-art for low-resource MSA in MDD.
Abstract:Differential microphone arrays offer a promising solution for far-field acoustic signal acquisition due to their high spatial directivity and compact array structure. A key challenge lies in designing differential beamformers that are continuously steerable and capable of enhancing target signals arriving from arbitrary directions. This paper studies the design of differential beamformers for circular arrays and proposes a novel framework that incorporates directional derivative constraints. By constraining the first-order derivatives of the beampattern at the desired steering direction to zero and assigning suitable values to higher-order derivatives, the beamformer is ensured to achieve its maximum response in the target direction and provide sufficient beam steering. This approach not only improves steering flexibility but also enables a more intuitive and robust beampattern design. Simulation results demonstrate that the proposed method produces continuously steerable beampatterns.
Abstract:Online blind source separation is essential for both speech communication and human-machine interaction. Among existing approaches, overdetermined independent vector analysis (OverIVA) delivers strong performance by exploiting the statistical independence of source signals and the orthogonality between source and noise subspaces. However, when applied to large microphone arrays, the number of parameters grows rapidly, which can degrade online estimation accuracy. To overcome this challenge, we propose decomposing each long separation filter into a bilinear form of two shorter filters, thereby reducing the number of parameters. Because the two filters are closely coupled, we design an alternating iterative projection algorithm to update them in turn. Simulation results show that, with far fewer parameters, the proposed method achieves improved performance and robustness.
Abstract:In this work, we propose a method to efficiently find the regularization parameter for low-rank MMSE filters based on a Kronecker-product representation. We show that the regularization parameter is surprisingly linked to the problem of rank selection and, thus, properly choosing it, is crucial for low-rank settings. The proposed method is validated through simulations, showing significant gains over commonly used methods.
Abstract:The performance of deep learning-based multi-channel speech enhancement methods often deteriorates when the geometric parameters of the microphone array change. Traditional approaches to mitigate this issue typically involve training on multiple microphone arrays, which can be costly. To address this challenge, we focus on uniform circular arrays and propose the use of a spatial filter bank to extract features that are approximately invariant to geometric parameters. These features are then processed by a two-stage conformer-based model (TSCBM) to enhance speech quality. Experimental results demonstrate that our proposed method can be trained on a fixed microphone array while maintaining effective performance across uniform circular arrays with unseen geometric configurations during applications.
Abstract:In this work, we present a new perspective on the origin and interpretation of adaptive filters. By applying Bayesian principles of recursive inference from the state-space model and using a series of simplifications regarding the structure of the solution, we can present, in a unified framework, derivations of many adaptive filters which depend on the probabilistic model of the observational noise. In particular, under a Gaussian model, we obtain solutions well-known in the literature (such as LMS, NLMS, or Kalman filter), while using non-Gaussian noise, we obtain new families of adaptive filter. Notably, under assumption of Laplacian noise, we obtain a family of robust filters of which the signed-error algorithm is a well-known member, while other algorithms, derived effortlessly in the proposed framework, are entirely new. Numerical examples are shown to illustrate the properties and provide a better insight into the performance of the derived adaptive filters.
Abstract:This paper reviews pioneering works in microphone array processing and multichannel speech enhancement, highlighting historical achievements, technological evolution, commercialization aspects, and key challenges. It provides valuable insights into the progression and future direction of these areas. The paper examines foundational developments in microphone array design and optimization, showcasing innovations that improved sound acquisition and enhanced speech intelligibility in noisy and reverberant environments. It then introduces recent advancements and cutting-edge research in the field, particularly the integration of deep learning techniques such as all-neural beamformers. The paper also explores critical applications, discussing their evolution and current state-of-the-art technologies that significantly impact user experience. Finally, the paper outlines future research directions, identifying challenges and potential solutions that could drive further innovation in these fields. By providing a comprehensive overview and forward-looking perspective, this paper aims to inspire ongoing research and contribute to the sustained growth and development of microphone arrays and multichannel speech enhancement.


Abstract:The so-called independent low-rank matrix analysis (ILRMA) has demonstrated a great potential for dealing with the problem of determined blind source separation (BSS) for audio and speech signals. This method assumes that the spectra from different frequency bands are independent and the spectral coefficients in any frequency band are Gaussian distributed. The Itakura-Saito divergence is then employed to estimate the source model related parameters. In reality, however, the spectral coefficients from different frequency bands may be dependent, which is not considered in the existing ILRMA algorithm. This paper presents an improved version of ILRMA, which considers the dependency between the spectral coefficients from different frequency bands. The Sinkhorn divergence is then exploited to optimize the source model parameters. As a result of using the cross-band information, the BSS performance is improved. But the number of parameters to be estimated also increases significantly, and so is the computational complexity. To reduce the algorithm complexity, we apply the Kronecker product to decompose the modeling matrix into the product of a number of matrices of much smaller dimensionality. An efficient algorithm is then developed to implement the Sinkhorn divergence based BSS algorithm and the complexity is reduced by an order of magnitude.