Perceptual evaluation of speech quality (PESQ) requires a clean speech reference as input, but predicts the results from (reference-free) absolute category rating (ACR) tests. In this work, we train a fully convolutional recurrent neural network (FCRN) as deep noise suppression (DNS) model, with either a non-intrusive or an intrusive PESQNet, where only the latter has access to a clean speech reference. The PESQNet is used as a mediator providing a perceptual loss during the DNS training to maximize the PESQ score of the enhanced speech signal. For the intrusive PESQNet, we investigate two topologies, called early-fusion (EF) and middle-fusion (MF) PESQNet, and compare to the non-intrusive PESQNet to evaluate and to quantify the benefits of employing a clean speech reference input during DNS training. Detailed analyses show that the DNS trained with the MF-intrusive PESQNet outperforms the Interspeech 2021 DNS Challenge baseline and the same DNS trained with an MSE loss by 0.23 and 0.12 PESQ points, respectively. Furthermore, we can show that only marginal benefits are obtained compared to the DNS trained with the non-intrusive PESQNet. Therefore, as ACR listening tests, the PESQNet does not necessarily require a clean speech reference input, opening the possibility of using real data for DNS training.
Although today's speech communication systems support various bandwidths from narrowband to super-wideband and beyond, state-of-the art DNN methods for acoustic echo cancellation (AEC) are lacking modularity and bandwidth scalability. Our proposed DNN model builds upon a fully convolutional recurrent network (FCRN) and introduces scalability over various bandwidths up to a fullband (FB) system (48 kHz sampling rate). This modular approach allows joint wideband (WB) pre-training of mask-based AEC and postfilter stages with dedicated losses, followed by a separate training of them on FB data. A third lightweight blind bandwidth extension stage is separately trained on FB data, flexibly allowing to extend the WB postfilter output towards higher bandwidths until reaching FB. Thereby, higher frequency noise and echo are reliably suppressed. On the ICASSP 2022 Acoustic Echo Cancellation Challenge blind test set we report a competitive performance, showing robustness even under highly delayed echo and dynamic echo path changes.
While deep neural networks (DNNs) achieve impressive performance on environment perception tasks, their sensitivity to adversarial perturbations limits their use in practical applications. In this paper, we (i) propose a novel adversarial perturbation detection scheme based on multi-task perception of complex vision tasks (i.e., depth estimation and semantic segmentation). Specifically, adversarial perturbations are detected by inconsistencies between extracted edges of the input image, the depth output, and the segmentation output. To further improve this technique, we (ii) develop a novel edge consistency loss between all three modalities, thereby improving their initial consistency which in turn supports our detection scheme. We verify our detection scheme's effectiveness by employing various known attacks and image noises. In addition, we (iii) develop a multi-task adversarial attack, aiming at fooling both tasks as well as our detection scheme. Experimental evaluation on the Cityscapes and KITTI datasets shows that under an assumption of a 5% false positive rate up to 100% of images are correctly detected as adversarially perturbed, depending on the strength of the perturbation. Code will be available on github. A short video at https://youtu.be/KKa6gOyWmH4 provides qualitative results.
Environment perception in autonomous driving vehicles often heavily relies on deep neural networks (DNNs), which are subject to domain shifts, leading to a significantly decreased performance during DNN deployment. Usually, this problem is addressed by unsupervised domain adaptation (UDA) approaches trained either simultaneously on source and target domain datasets or even source-free only on target data in an offline fashion. In this work, we further expand a source-free UDA approach to a continual and therefore online-capable UDA on a single-image basis for semantic segmentation. Accordingly, our method only requires the pre-trained model from the supplier (trained in the source domain) and the current (unlabeled target domain) camera image. Our method Continual BatchNorm Adaptation (CBNA) modifies the source domain statistics in the batch normalization layers, using target domain images in an unsupervised fashion, which yields consistent performance improvements during inference. Thereby, in contrast to existing works, our approach can be applied to improve a DNN continuously on a single-image basis during deployment without access to source data, without algorithmic delay, and nearly without computational overhead. We show the consistent effectiveness of our method across a wide variety of source/target domain settings for semantic segmentation. As part of this work, our code will be made publicly available.
In recent years, semantic segmentation has taken benefit from various works in computer vision. Inspired by the very versatile CycleGAN architecture, we combine semantic segmentation with the concept of cycle consistency to enable a multitask training protocol. However, learning is largely prevented by the so-called steganography effect, which expresses itself as watermarks in the latent segmentation domain, making image reconstruction a too easy task. To combat this, we propose a noise injection, based either on quantization noise or on Gaussian noise addition to avoid this disadvantageous information flow in the cycle architecture. We find that noise injection significantly reduces the generation of watermarks and thus allows the recognition of highly relevant classes such as "traffic signs", which are hardly detected by the ERFNet baseline. We report mIoU and PSNR results on the Cityscapes dataset for semantic segmentation and image reconstruction, respectively. The proposed methodology allows to achieve an mIoU improvement on the Cityscapes validation set of 5.7% absolute over the same CycleGAN without noise injection, and still an absolute 4.9% over the ERFNet non-cyclic baseline.
Reconfigurable intelligent surface (RIS) is an emerging technology for future wireless communication systems. In this work, we consider downlink spatial multiplexing enabled by the RIS for weighted sum-rate (WSR) maximization. In the literature, most solutions use alternating gradient-based optimization, which has moderate performance, high complexity, and limited scalability. We propose to apply a fully convolutional network (FCN) to solve this problem, which was originally designed for semantic segmentation of images. The rectangular shape of the RIS and the spatial correlation of channels with adjacent RIS antennas due to the short distance between them encourage us to apply it for the RIS configuration. We design a set of channel features that includes both cascaded channels via the RIS and the direct channel. In the base station (BS), the differentiable minimum mean squared error (MMSE) precoder is used for pretraining and the weighted minimum mean squared error (WMMSE) precoder is then applied for fine-tuning, which is nondifferentiable, more complex, but achieves a better performance. Evaluation results show that the proposed solution has higher performance and allows for a faster evaluation than the baselines. Hence it scales better to a large number of antennas, advancing the RIS one step closer to practical deployment.
Speech enhancement employing deep neural networks (DNNs) for denoising are called deep noise suppression (DNS). During training, DNS methods are typically trained with mean squared error (MSE) type loss functions, which do not guarantee good perceptual quality. Perceptual evaluation of speech quality (PESQ) is a widely used metric for evaluating speech quality. However, the original PESQ algorithm is non-differentiable, and therefore cannot directly be used as optimization criterion for gradient-based learning. In this work, we propose an end-to-end non-intrusive PESQNet DNN to estimate the PESQ scores of the enhanced speech signal. Thus, by providing a reference-free perceptual loss, it serves as a mediator towards the DNS training, allowing to maximize the PESQ score of the enhanced speech signal. We illustrate the potential of our proposed PESQNet-mediated training on the basis of an already strong baseline DNS. As further novelty, we propose to train the DNS and the PESQNet alternatingly to keep the PESQNet up-to-date and perform well specifically for the DNS under training. Our proposed method is compared to the same DNS trained with MSE-based loss for joint denoising and dereverberation, and the Interspeech 2021 DNS Challenge baseline. Detailed analysis shows that the PESQNet mediation can further increase the DNS performance by about 0.1 PESQ points on synthetic test data and by 0.03 DNSMOS points on real test data, compared to training with the MSE-based loss. Our proposed method also outperforms the Challenge baseline by 0.2 PESQ points on synthetic test data and 0.1 DNSMOS points on real test data.
Scaling the distribution of automated vehicles requires handling various unexpected and possibly dangerous situations, termed corner cases (CC). Since many modules of automated driving systems are based on machine learning (ML), CC are an essential part of the data for their development. However, there is only a limited amount of CC data in large-scale data collections, which makes them challenging in the context of ML. With a better understanding of CC, offline applications, e.g., dataset analysis, and online methods, e.g., improved performance of automated driving systems, can be improved. While there are knowledge-based descriptions and taxonomies for CC, there is little research on machine-interpretable descriptions. In this extended abstract, we will give a brief overview of the challenges and goals of such a description.