Speaker identification in the household scenario (e.g., for smart speakers) is typically based on only a few enrollment utterances but a much larger set of unlabeled data, suggesting semisupervised learning to improve speaker profiles. We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario, to leverage the unlabeled speech samples. In contrast to most of the works in speaker recognition that focus on speaker-discriminative embeddings, this work focuses on speaker label inference (scoring). Given a pre-trained embedding extractor, graph-based learning allows us to integrate information about both labeled and unlabeled utterances. Considering each utterance as a graph node, we represent pairwise utterance similarity scores as edge weights. Graphs are constructed per household, and speaker identities are propagated to unlabeled nodes to optimize a global consistency criterion. We show in experiments on the VoxCeleb dataset that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods as well as their semi-supervised variants based on pseudo-labels.
The L3DAS21 Challenge is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing, with particular focus on 3D speech enhancement (SE) and 3D sound localization and detection (SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage. Usually, machine learning approaches to 3D audio tasks are based on single-perspective Ambisonics recordings or on arrays of single-capsule microphones. We propose, instead, a novel multichannel audio configuration based multiple-source and multiple-perspective Ambisonics recordings, performed with an array of two first-order Ambisonics microphones. To the best of our knowledge, it is the first time that a dual-mic Ambisonics configuration is used for these tasks. We provide baseline models and results for both tasks, obtained with state-of-the-art architectures: FaSNet for SE and SELDNet for SELD. This report is aimed at providing all needed information to participate in the L3DAS21 Challenge, illustrating the details of the L3DAS21 dataset, the challenge tasks and the baseline models.
Online Transformer-based automatic speech recognition (ASR) systems have been extensively studied due to the increasing demand for streaming applications. Recently proposed Decoder-end Adaptive Computation Steps (DACS) algorithm for online Transformer ASR was shown to achieve state-of-the-art performance and outperform other existing methods. However, like any other online approach, the DACS-based attention heads in each of the Transformer decoder layers operate independently (or asynchronously) and lead to diverged attending positions. Since DACS employs a truncation threshold to determine the halting position, some of the attention weights are cut off untimely and might impact the stability and precision of decoding. To overcome these issues, here we propose a head-synchronous (HS) version of the DACS algorithm, where the boundary of attention is jointly detected by all the DACS heads in each decoder layer. ASR experiments on Wall Street Journal (WSJ), AIShell-1 and Librispeech show that the proposed method consistently outperforms vanilla DACS and achieves state-of-the-art performance. We will also demonstrate that HS-DACS has reduced decoding cost when compared to vanilla DACS.
As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and cross-entropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups.
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing methods do not scale well to large scale distributed systems (due to gradient build-up) and/or fail to evaluate model fidelity (test accuracy) on large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleCom), that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. Using theoretical analysis, we show that ScaleCom provides favorable convergence guarantees and is compatible with gradient all-reduce techniques. Furthermore, we experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) across a wide range of applications (image, language, and speech) without significant accuracy loss.
Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition and achieved competitive performance compared with other NAR methods. However, such an alignment learning strategy may also result in inaccurate acoustic boundary estimation and deceleration in convergence speed. To eliminate these drawbacks and improve performance further, we incorporate an additional connectionist temporal classification (CTC) based alignment loss and a contextual decoder into the CIF-based NAR model. Specifically, we use the CTC spike information to guide the leaning of acoustic boundary and adopt a new contextual decoder to capture the linguistic dependencies within a sentence in the conventional CIF model. Besides, a recently proposed Conformer architecture is also employed to model both local and global acoustic dependencies. Experiments on the open-source Mandarin corpora AISHELL-1 show that the proposed method achieves a comparable character error rate (CER) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model.
This paper presents a new hybrid architecture for voice activity detection (VAD) incorporating both convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) layers trained in an end-to-end manner. In addition, we focus specifically on optimising the computational efficiency of our architecture in order to deliver robust performance in difficult in-the-wild noise conditions in a severely under-resourced setting. Nested k-fold cross-validation was used to explore the hyperparameter space, and the trade-off between optimal parameters and model size is discussed. The performance effect of a BiLSTM layer compared to a unidirectional LSTM layer was also considered. We compare our systems with three established baselines on the AVA-Speech dataset. We find that significantly smaller models with near optimal parameters perform on par with larger models trained with optimal parameters. BiLSTM layers were shown to improve accuracy over unidirectional layers by $\approx$2% absolute on average. With an area under the curve (AUC) of 0.951, our system outperforms all baselines, including a much larger ResNet system, particularly in difficult noise conditions.
Amplitude demodulation is a classical operation used in signal processing. For a long time, its effective applications in practice have been limited to narrowband signals. In this work, we generalize amplitude demodulation to wideband signals. We pose demodulation as a recovery problem of an oversampled corrupted signal and introduce special iterative schemes belonging to the family of alternating projection algorithms to solve it. Sensibly chosen structural assumptions on the demodulation outputs allow us to reveal the high inferential accuracy of the method over a rich set of relevant signals. This new approach surpasses current state-of-the-art demodulation techniques apt to wideband signals in computational efficiency by up to many orders of magnitude with no sacrifice in quality. Such performance opens the door for applications of the amplitude demodulation procedure in new contexts. In particular, the new method makes online and large-scale offline data processing feasible, including the calculation of modulator-carrier pairs in higher dimensions and poor sampling conditions, independent of the signal bandwidth. We illustrate the utility and specifics of applications of the new method in practice by using synthetic and natural speech signals.