Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Audio Super Resolution

What is Audio Super Resolution? Audio super resolution is the process of enhancing the quality of audio signals by increasing the sampling rate or frequency.

Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers

Jan 06, 2026

Mikhail Silaev, Konstantinos Drossos, Tuomas Virtanen

Abstract:Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ($4\to 16$~kHz) and full-band ($16\to 48$~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

* Accepted for publication in Workshop Proceedingsof the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

Via

Access Paper or Ask Questions

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Nov 15, 2025

Bingsong Bai, Yizhong Geng, Fengping Wang, Cong Wang, Puyuan Guo, Yingming Gao, Ya Li

Abstract:Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

* Accepted by AAAI 2026 main technical track

Via

Access Paper or Ask Questions

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

Oct 01, 2025

Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang

Abstract:In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.

* Submitted to ICASSP 2026

Via

Access Paper or Ask Questions

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Oct 23, 2025

Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Figure 1 for Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Figure 2 for Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Figure 3 for Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Figure 4 for Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Abstract:Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

Via

Access Paper or Ask Questions

Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

Aug 01, 2025

Ismael Nawfal, Symeon Delikaris Manias, Mehrez Souden, Juha Merimaa, Joshua Atkins, Elisabeth McMullin, Shadi Pirhosseinloo, Daniel Phillips

Abstract:Ambisonics is a spatial audio format describing a sound field. First-order Ambisonics (FOA) is a popular format comprising only four channels. This limited channel count comes at the expense of spatial accuracy. Ideally one would be able to take the efficiency of a FOA format without its limitations. We have devised a data-driven spatial audio solution that retains the efficiency of the FOA format but achieves quality that surpasses conventional renderers. Utilizing a fully convolutional time-domain audio neural network (Conv-TasNet), we created a solution that takes a FOA input and provides a higher order Ambisonics (HOA) output. This data driven approach is novel when compared to typical physics and psychoacoustic based renderers. Quantitative evaluations showed a 0.6dB average positional mean squared error difference between predicted and actual 3rd order HOA. The median qualitative rating showed an 80% improvement in perceived quality over the traditional rendering approach.

Via

Access Paper or Ask Questions

Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Aug 06, 2025

Yuqin Cao, Yixuan Gao, Wei Sun, Xiaohong Liu, Yulun Zhang, Xiongkuo Min

Figure 1 for Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Figure 2 for Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Figure 3 for Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Figure 4 for Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Abstract:Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.

Via

Access Paper or Ask Questions

ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment

Jun 24, 2025

Shengkui Zhao, Zexu Pan, Bin Ma

Abstract:This paper introduces ClearerVoice-Studio, an open-source, AI-powered speech processing toolkit designed to bridge cutting-edge research and practical application. Unlike broad platforms like SpeechBrain and ESPnet, ClearerVoice-Studio focuses on interconnected speech tasks of speech enhancement, separation, super-resolution, and multimodal target speaker extraction. A key advantage is its state-of-the-art pretrained models, including FRCRN with 3 million uses and MossFormer with 2.5 million uses, optimized for real-world scenarios. It also offers model optimization tools, multi-format audio support, the SpeechScore evaluation toolkit, and user-friendly interfaces, catering to researchers, developers, and end-users. Its rapid adoption attracting 3000 GitHub stars and 239 forks highlights its academic and industrial impact. This paper details ClearerVoice-Studio's capabilities, architectures, training strategies, benchmarks, community impact, and future plan. Source code is available at https://github.com/modelscope/ClearerVoice-Studio.

* accepted by Interspeech 2025, 5 pages, 5 tables

Via

Access Paper or Ask Questions

FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

Mar 26, 2025

Hao-Han Guo, Kun Xie, Yi-Chen Wu, Feng-Long Xie

Figure 1 for FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

Figure 2 for FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

Figure 3 for FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

Abstract:In this work, we propose a high-quality streaming foundation text-to-speech system, FireRedTTS-1S, upgraded from the streamable version of FireRedTTS. FireRedTTS-1S achieves streaming generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from the text via a semantic language model in an auto-regressive manner. Meanwhile, the semantic-to-acoustic decoding module simultaneously translates generated semantic tokens into the speech signal in a streaming way via a super-resolution causal audio codec and a multi-stream acoustic language model. This design enables us to produce high-quality speech audio in zero-shot settings while presenting a real-time generation process with low latency under 150ms. In experiments on zero-shot voice cloning, the objective results validate FireRedTTS-1S as a high-quality foundation model with comparable intelligibility and speaker similarity over industrial baseline systems. Furthermore, the subjective score of FireRedTTS-1S highlights its impressive synthesis performance, achieving comparable quality to the ground-truth recordings. These results validate FireRedTTS-1S as a high-quality streaming foundation TTS system.

Via

Access Paper or Ask Questions

FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

Jan 18, 2025

Jaekwon Im, Juhan Nam

Figure 1 for FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

Figure 2 for FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

Figure 3 for FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

Abstract:Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.

* 4 pages, 3 figures

Via

Access Paper or Ask Questions

FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Jan 09, 2025

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

Figure 1 for FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Figure 2 for FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Figure 3 for FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Figure 4 for FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Abstract:Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Topic:Audio Super Resolution

Papers and Code