Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tuomas Virtanen

Tampere University

Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention

Jan 30, 2026

Mikko Heikkinen, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

Abstract:We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

* Accepted to ICASSP 2026

Via

Access Paper or Ask Questions

Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers

Jan 06, 2026

Mikhail Silaev, Konstantinos Drossos, Tuomas Virtanen

Abstract:Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ($4\to 16$~kHz) and full-band ($16\to 48$~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

* Accepted for publication in Workshop Proceedingsof the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

Via

Access Paper or Ask Questions

Acoustic Simulation Framework for Multi-channel Replay Speech Detection

Sep 18, 2025

Michael Neri, Tuomas Virtanen

Figure 1 for Acoustic Simulation Framework for Multi-channel Replay Speech Detection

Figure 2 for Acoustic Simulation Framework for Multi-channel Replay Speech Detection

Figure 3 for Acoustic Simulation Framework for Multi-channel Replay Speech Detection

Figure 4 for Acoustic Simulation Framework for Multi-channel Replay Speech Detection

Abstract:Replay speech attacks pose a significant threat to voice-controlled systems, especially in smart environments where voice assistants are widely deployed. While multi-channel audio offers spatial cues that can enhance replay detection robustness, existing datasets and methods predominantly rely on single-channel recordings. In this work, we introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations using publicly available resources. Our setup models both genuine and spoofed speech across varied environments, including realistic microphone and loudspeaker impulse responses, room acoustics, and noise conditions. The framework employs measured loudspeaker directionalities during the replay attack to improve the realism of the simulation. We define two spoofing settings, which simulate whether a reverberant or an anechoic speech is used in the replay scenario, and evaluate the impact of omnidirectional and diffuse noise on detection performance. Using the state-of-the-art M-ALRAD model for replay speech detection, we demonstrate that synthetic data can support the generalization capabilities of the detector across unseen enclosures.

* Submitted to ICASSP 2026

Via

Access Paper or Ask Questions

Multi-Utterance Speech Separation and Association Trained on Short Segments

Jul 03, 2025

Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

Abstract:Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.

* 5 pages, accepted by WASPAA 2025

Via

Access Paper or Ask Questions

Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection

May 28, 2025

Shiqi Zhang, Tuomas Virtanen

Abstract:Bioacoustic sound event detection (BioSED) is crucial for biodiversity conservation but faces practical challenges during model development and training: limited amounts of annotated data, sparse events, species diversity, and class imbalance. To address these challenges efficiently with a limited labeling budget, we apply the mismatch-first farthest-traversal (MFFT), an active learning method integrating committee voting disagreement and diversity analysis. We also refine an existing BioSED dataset specifically for evaluating active learning algorithms. Experimental results demonstrate that MFFT achieves a mAP of 68% when cold-starting and 71% when warm-starting (which is close to the fully-supervised mAP of 75%) while using only 2.3% of the annotations. Notably, MFFT excels in cold-start scenarios and with rare species, which are critical for monitoring endangered species, demonstrating its practical value.

* 5 pages, 1 figure, accepted by EUSIPCO 2025 v2: add our github repo

Via

Access Paper or Ask Questions

Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers

May 22, 2025

Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

Abstract:This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existing methods by introducing an attractor-based architecture that effectively combines local and global temporal modeling for multi-utterance scenarios. To evaluate the method in reverberant and noisy conditions, a multi-speaker multi-utterance dataset was synthesized by combining Librispeech speech signals with WHAM! noise signals. The results demonstrate that the proposed system accurately estimates the number of sources. The system effectively detects source activities and separates the corresponding utterances into correct outputs in both known and unknown source count scenarios.

* 5 pages, 4 figures, accepted by Interspeech 2025

Via

Access Paper or Ask Questions

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

May 20, 2025

Parthasaarathy Sudarsanam, Irene Martín-Morató, Tuomas Virtanen

Abstract:This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.

* Accepted to European Signal Processing Conference (EUSIPCO 2025)

Via

Access Paper or Ask Questions

Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

May 06, 2025

Diep Luong, Mikko Heikkinen, Konstantinos Drossos, Tuomas Virtanen

Figure 1 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Figure 2 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Figure 3 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Figure 4 for Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

Abstract:Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources computational environments, like hand-held devices, intelligent glasses, hearing aids, etc. Knowledge distillation (KD) is a prominent way for alleviating this complexity mismatch and is based on the transferring/distilling of knowledge from a pre-trained complex model, the teacher, to another less complex one, the student. Existing KD methods for speech denoising are based on processes that potentially hamper the KD by bounding the learning of the student to the distribution, information ordering, and feature dimensionality learned by the teacher. In this paper, we present and assess a method that tries to treat this issue, by exploiting the well-known denoising-autoencoder framework, the linear inverted bottlenecks, and the properties of the cosine similarity. We use a public dataset and conduct repeated experiments with different mismatching scenarios between the teacher and the student, reporting the mean and standard deviation of the metrics of our method and another, state-of-the-art method that is used as a baseline. Our results show that with the proposed method, the student can perform better and can also retain greater mismatching conditions compared to the teacher.

Via

Access Paper or Ask Questions

Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music

Mar 10, 2025

Eetu Tunturi, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen

Figure 1 for Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music

Figure 2 for Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music

Figure 3 for Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music

Figure 4 for Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music

Abstract:Music source separation is the task of separating a mixture of instruments into constituent tracks. Music source separation models are typically trained using only audio data, although additional information can be used to improve the model's separation capability. In this paper, we propose two ways of using musical scores to aid music source separation: a score-informed model where the score is concatenated with the magnitude spectrogram of the audio mixture as the input of the model, and a model where we use only the score to calculate the separation mask. We train our models on synthetic data in the SynthSOD dataset and evaluate our methods on the URMP and Aalto anechoic orchestra datasets, comprised of real recordings. The score-informed model improves separation results compared to a baseline approach, but struggles to generalize from synthetic to real data, whereas the score-only model shows a clear improvement in synthetic-to-real generalization.

* 5 pages, 2 figures, submitted to Eusipco2025

Via

Access Paper or Ask Questions

Impact of Microphone Array Mismatches to Learning-based Replay Speech Detection

Mar 10, 2025

Michael Neri, Tuomas Virtanen

Abstract:In this work, we investigate the generalization of a multi-channel learning-based replay speech detector, which employs adaptive beamforming and detection, across different microphone arrays. In general, deep neural network-based microphone array processing techniques generalize poorly to unseen array types, i.e., showing a significant training-test mismatch of performance. We employ the ReMASC dataset to analyze performance degradation due to inter- and intra-device mismatches, assessing both single- and multi-channel configurations. Furthermore, we explore fine-tuning to mitigate the performance loss when transitioning to unseen microphone arrays. Our findings reveal that array mismatches significantly decrease detection accuracy, with intra-device generalization being more robust than inter-device. However, fine-tuning with as little as ten minutes of target data can effectively recover performance, providing insights for practical deployment of replay detection systems in heterogeneous automatic speaker verification environments.

* Submitted to EUSIPCO 2025

Via

Access Paper or Ask Questions