Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maris Hillemann

Real-Time Streamable Generative Speech Restoration with Flow Matching

Dec 22, 2025

Simon Welker, Bunlong Lay, Maris Hillemann, Tal Peer, Timo Gerkmann

Figure 1 for Real-Time Streamable Generative Speech Restoration with Flow Matching

Figure 2 for Real-Time Streamable Generative Speech Restoration with Flow Matching

Figure 3 for Real-Time Streamable Generative Speech Restoration with Flow Matching

Figure 4 for Real-Time Streamable Generative Speech Restoration with Flow Matching

Abstract:Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream.FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream.FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream.FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Sep 22, 2024

Christian Wilms, Tim Rolff, Maris Hillemann, Robert Johanson, Simone Frintrop

Figure 1 for SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Figure 2 for SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Figure 3 for SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Figure 4 for SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Abstract:We propose an approach for Open-World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of annotated object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state-of-the-art systems, which often generate background detections. To this end, we generate high-quality pseudo annotations based on the foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self-attention maps from self-supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post-processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision by up to 81.6% compared to the state-of-the-art. Source code is available at: https://github.com/chwilms/SOS

* Accepted at ECCV 2024. Code available at https://github.com/chwilms/SOS

Via

Access Paper or Ask Questions