Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tim Fingscheidt

OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data

Sep 18, 2025

Björn Möller, Zhengyang Li, Malte Stelzer, Thomas Graave, Fabian Bettels, Muaaz Ataya, Tim Fingscheidt

Abstract:Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.

Via

Access Paper or Ask Questions

Interspeech 2025 URGENT Speech Enhancement Challenge

May 29, 2025

Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang(+2 more)

Abstract:There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions. The URGENT Challenge series aims to foster such universal SE by embracing a broad range of distortion types, increasing data diversity, and incorporating extensive evaluation metrics. This work introduces the Interspeech 2025 URGENT Challenge, the second edition of the series, to explore several aspects that have received limited attention so far: language dependency, universality for more distortion types, data scalability, and the effectiveness of using noisy training data. We received 32 submissions, where the best system uses a discriminative model, while most other competitive ones are hybrid methods. Analysis reveals some key findings: (i) some generative or hybrid approaches are preferred in subjective evaluations over the top discriminative model, and (ii) purely generative SE models can exhibit language dependency.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

May 10, 2025

Patrick Blumenberg, Thomas Graave, Tim Fingscheidt

Figure 1 for Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

Figure 2 for Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

Figure 3 for Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

Figure 4 for Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

Abstract:Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.

Via

Access Paper or Ask Questions

A Lightweight Image Super-Resolution Transformer Trained on Low-Resolution Images Only

Mar 30, 2025

Björn Möller, Lucas Görnhardt, Tim Fingscheidt

Abstract:Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: https://github.com/ifnspaml/SuperResolutionMultiscaleTraining.

Via

Access Paper or Ask Questions

Foundation Models for Amodal Video Instance Segmentation in Automated Driving

Sep 21, 2024

Jasmin Breitenstein, Franz Jünger, Andreas Bär, Tim Fingscheidt

Abstract:In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade-off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video-level amodal instance segmentation results. Our resulting S-AModal method achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels. Code for S-AModal is available at https://github.com/ifnspaml/S-AModal.

* accepted at ECCV VCAD Workshop 2024

Via

Access Paper or Ask Questions

Non-Causal to Causal SSL-Supported Transfer Learning: Towards a High-Performance Low-Latency Speech Vocode

Aug 07, 2024

Renzheng Shi, Andreas Bär, Marvin Sach, Wouter Tirry, Tim Fingscheidt

Figure 1 for Non-Causal to Causal SSL-Supported Transfer Learning: Towards a High-Performance Low-Latency Speech Vocode

Figure 2 for Non-Causal to Causal SSL-Supported Transfer Learning: Towards a High-Performance Low-Latency Speech Vocode

Figure 3 for Non-Causal to Causal SSL-Supported Transfer Learning: Towards a High-Performance Low-Latency Speech Vocode

Figure 4 for Non-Causal to Causal SSL-Supported Transfer Learning: Towards a High-Performance Low-Latency Speech Vocode

Abstract:Recently, BigVGAN has emerged as high-performance speech vocoder. Its sequence-to-sequence-based synthesis, however, prohibits usage in low-latency conversational applications. Our work addresses this shortcoming in three steps. First, we introduce low latency into BigVGAN via implementing causal convolutions, yielding decreased performance. Second, to regain performance, we propose a teacher-student transfer learning scheme to distill the high-delay non-causal BigVGAN into our low-latency causal vocoder. Third, taking advantage of a self-supervised learning (SSL) model, in our case wav2vec 2.0, we align its encoder speech representations extracted from our low-latency causal vocoder to the ground truth ones. In speaker-independent settings, both proposed training schemes notably elevate the performance of our low-latency vocoder, closing up to the original high-delay BigVGAN. At only 21% higher complexity, our best small causal vocoder achieves 3.96 PESQ and 1.25 MCD, excelling even the original small non-causal BigVGAN (3.64 PESQ) by 0.32 PESQ and 0.1 MCD points, respectively.

* Accepted at IWAENC 2024

Via

Access Paper or Ask Questions

Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Jul 15, 2024

Danish Nazir, Timo Bartels, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt

Figure 1 for Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Figure 2 for Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Figure 3 for Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Figure 4 for Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding

Abstract:Distributed computing in the context of deep neural networks (DNNs) implies the execution of one part of the network on edge devices and the other part typically on a large-scale cloud platform. Conventional methods propose to employ a serial concatenation of a learned image and source encoder, the latter projecting the image encoder output (bottleneck features) into a quantized representation for bitrate-efficient transmission. In the cloud, a respective source decoder reprojects the quantized representation to the original feature representation, serving as an input for the downstream task decoder performing, e.g., semantic segmentation. In this work, we propose joint source and task decoding, as it allows for a smaller network size in the cloud. This further enables the scalability of such services in large numbers without requiring extensive computational load on the cloud per channel. We demonstrate the effectiveness of our method by achieving a distributed semantic segmentation SOTA over a wide range of bitrates on the mean intersection over union metric, while using only $9.8 \%$ ... $11.59 \%$ of cloud DNN parameters used in the previous SOTA on the COCO and Cityscapes datasets.

* accepted in ECCV2024

Via

Access Paper or Ask Questions

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Jun 07, 2024

Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe(+2 more)

Figure 1 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 2 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 3 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 4 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Abstract:The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declipping. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics.

* 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

Via

Access Paper or Ask Questions

Efficient High-Performance Bark-Scale Neural Network for Residual Echo and Noise Suppression

Apr 08, 2024

Ernst Seidel, Pejman Mowlaee, Tim Fingscheidt

Abstract:In recent years, the introduction of neural networks (NNs) into the field of speech enhancement has brought significant improvements. However, many of the proposed methods are quite demanding in terms of computational complexity and memory footprint. For the application in dedicated communication devices, such as speakerphones, hands-free car systems, or smartphones, efficiency plays a major role along with performance. In this context, we present an efficient, high-performance hybrid joint acoustic echo control and noise suppression system, whereby our main contribution is the postfilter NN, performing both noise and residual echo suppression. The preservation of nearend speech is improved by a Bark-scale auditory filterbank for the NN postfilter. The proposed hybrid method is benchmarked with state-of-the-art methods and its effectiveness is demonstrated on the ICASSP 2023 AEC Challenge blind test set. We demonstrate that it offers high-quality nearend speech preservation during both double-talk and nearend speech conditions. At the same time, it is capable of efficient removal of echo leaks, achieving a comparable performance to already small state-of-the-art models such as the end-to-end DeepVQE-S, while requiring only around 10 % of its computational complexity. This makes it easily realtime implementable on a speakerphone device.

* accepted to ICASSP 2024; 5 pages, 3 figures

Via

Access Paper or Ask Questions

Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation

Dec 04, 2023

Joshua Niemeijer, Manuel Schwonberg, Jan-Aike Termöhlen, Nico M. Schmidt, Tim Fingscheidt

Abstract:When models, e.g., for semantic segmentation, are applied to images that are vastly different from training data, the performance will drop significantly. Domain adaptation methods try to overcome this issue, but need samples from the target domain. However, this might not always be feasible for various reasons and therefore domain generalization methods are useful as they do not require any target data. We present a new diffusion-based domain extension (DIDEX) method and employ a diffusion model to generate a pseudo-target domain with diverse text prompts. In contrast to existing methods, this allows to control the style and content of the generated images and to introduce a high diversity. In a second step, we train a generalizing model by adapting towards this pseudo-target domain. We outperform previous approaches by a large margin across various datasets and architectures without using any real data. For the generalization from GTA5, we improve state-of-the-art mIoU performance by 3.8% absolute on average and for SYNTHIA by 11.8% absolute, marking a big step for the generalization performance on these benchmarks. Code is available at https://github.com/JNiemeijer/DIDEX

* Accepted to WACV 2024

Via

Access Paper or Ask Questions