Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mikko Heikkinen

Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention

Jan 30, 2026

Mikko Heikkinen, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

Abstract:We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

* Accepted to ICASSP 2026

Via

Access Paper or Ask Questions

Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns

Sep 05, 2025

Konstantinos Drossos, Mikko Heikkinen, Paschalis Tsiaflakis

Abstract:Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD, few are optimized for resource-constrained platforms such as mobile devices. Additionally, most DNN-based methods for SD are not focusing on full-band (FB) signals, i.e. having 48 kHz sampling rate, and/or low latency cases. In this paper we present a causal, low latency, and lightweight DNN-based method for full-band SD, leveraging both short and long temporal patterns. The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks for exploiting short and long temporal patterns in the signal and estimated denoising mask. The DNN operates on a causal frame-by-frame basis taking as an input the STFT magnitude, utilizes inverted bottlenecks inspired by MobileNet, employs causal instance normalization for channel-wise normalization, and achieves a real-time factor below 0.02 when deployed on a modern mobile phone. The proposed method is evaluated using established speech denoising metrics and publicly available datasets, demonstrating its effectiveness in achieving an (SI-)SDR value that outperforms existing FB and low latency SD methods.

* Accepted for publication in Proceedings of the 2025 IEEE 27th International Workshop on Multimedia Signal Processing (MMSP)

Via

Access Paper or Ask Questions

Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

May 06, 2025

Diep Luong, Mikko Heikkinen, Konstantinos Drossos, Tuomas Virtanen

Figure 1 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Figure 2 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Figure 3 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Figure 4 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Abstract:Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources computational environments, like hand-held devices, intelligent glasses, hearing aids, etc. Knowledge distillation (KD) is a prominent way for alleviating this complexity mismatch and is based on the transferring/distilling of knowledge from a pre-trained complex model, the teacher, to another less complex one, the student. Existing KD methods for speech denoising are based on processes that potentially hamper the KD by bounding the learning of the student to the distribution, information ordering, and feature dimensionality learned by the teacher. In this paper, we present and assess a method that tries to treat this issue, by exploiting the well-known denoising-autoencoder framework, the linear inverted bottlenecks, and the properties of the cosine similarity. We use a public dataset and conduct repeated experiments with different mismatching scenarios between the teacher and the student, reporting the mean and standard deviation of the metrics of our method and another, state-of-the-art method that is used as a baseline. Our results show that with the proposed method, the student can perform better and can also retain greater mismatching conditions compared to the teacher.

Via

Access Paper or Ask Questions

Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

Jan 14, 2025

Mikko Heikkinen, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

Figure 1 for Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

Figure 2 for Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

Figure 3 for Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

Abstract:Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.

* Accepted for publication in Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing

Via

Access Paper or Ask Questions

Neural Ambisonics encoding for compact irregular microphone arrays

Jan 11, 2024

Mikko Heikkinen, Archontis Politis, Tuomas Virtanen

Figure 1 for Neural Ambisonics encoding for compact irregular microphone arrays

Figure 2 for Neural Ambisonics encoding for compact irregular microphone arrays

Abstract:Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays. This paper proposes a method for Ambisonics encoding that uses a deep neural network (DNN) to estimate a signal transform from microphone inputs to Ambisonics signals. The approach uses a DNN consisting of a U-Net structure with a learnable preprocessing as well as a loss function consisting of mean average error, spatial correlation, and energy preservation components. The method is validated on two microphone arrays with regular and irregular shapes having four microphones, on simulated reverberant scenes with multiple sources. The results of the validation show that the proposed method can meet or exceed the performance of a conventional signal-independent Ambisonics encoder on a number of error metrics.

* Accepted for publication in Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing

Via

Access Paper or Ask Questions