Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Adapting Speaker Embeddings for Speaker Diarisation

Apr 07, 2021
Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung

The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have directly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.

* 5 pages, 2 figures, 3 tables, submitted to Interspeech as a conference paper 

  Access Paper or Ask Questions

Multi-Decoder DPRNN: High Accuracy Source Counting and Separation

Nov 30, 2020
Junzhe Zhu, Raymond Yeh, Mark Hasegawa-Johnson

We propose an end-to-end trainable approach to single-channel speech separation with unknown number of speakers. Our approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals. Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers. Specifically, we cleared up the issue on how to evaluate the quality when the ground-truth hasmore or less speakers than the ones predicted by the model. We evaluate our approach on the WSJ0-mix datasets, with mixtures up to five speakers. We demonstrate that our approach outperforms state-of-the-art in counting the number of speakers and remains competitive in quality of reconstructed signals.

* Project Page: Submitted to ICASSP 2021 

  Access Paper or Ask Questions

The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker Diarisation Challenge

Oct 22, 2020
Renyu Wang, Ruilin Tong, Yu Ting Yeung, Xiao Chen

This paper describes the development of our system for the VoxCeleb Speaker Diarisation Challenge 2020. A well trained neural network based speech enhancement model is used for pre-processing and a neural network based voice activity detection (VAD) system is followed to remove background music and noise which are harmful for speaker diarisation system. The following diarisation system is built based on agglomerative hierarchical clustering (AHC) of x-vectors and a variational Bayesian hidden Markov Model (VB-HMM) based iterative clustering. Experimental results demonstrate that the proposed system yields substantial improvements compared with the baseline method for the diarisation task of the VoxCeleb Speaker Recognition Challenge 2020.

* 5 pages, 2 figures, A report about our diarisation system for VoxCeleb Challenge, Interspeech conference workshop 

  Access Paper or Ask Questions

Linguists Who Use Probabilistic Models Love Them: Quantification in Functional Distributional Semantics

Jun 04, 2020
Guy Emerson

Functional Distributional Semantics provides a computationally tractable framework for learning truth-conditional semantics from a corpus. Previous work in this framework has provided a probabilistic version of first-order logic, recasting quantification as Bayesian inference. In this paper, I show how the previous formulation gives trivial truth values when a precise quantifier is used with vague predicates. I propose an improved account, avoiding this problem by treating a vague predicate as a distribution over precise predicates. I connect this account to recent work in the Rational Speech Acts framework on modelling generic quantification, and I extend this to modelling donkey sentences. Finally, I explain how the generic quantifier can be both pragmatically complex and yet computationally simpler than precise quantifiers.

* To be published in Proceedings of Probability and Meaning 2020 

  Access Paper or Ask Questions

Surprisal-Triggered Conditional Computation with Neural Networks

Jun 02, 2020
Loren Lugosch, Derek Nowrouzezahrai, Brett H. Meyer

Autoregressive neural network models have been used successfully for sequence generation, feature extraction, and hypothesis scoring. This paper presents yet another use for these models: allocating more computation to more difficult inputs. In our model, an autoregressive model is used both to extract features and to predict observations in a stream of input observations. The surprisal of the input, measured as the negative log-likelihood of the current observation according to the autoregressive model, is used as a measure of input difficulty. This in turn determines whether a small, fast network, or a big, slow network, is used. Experiments on two speech recognition tasks show that our model can match the performance of a baseline in which the big network is always used with 15% fewer FLOPs.

  Access Paper or Ask Questions

Study of Deep Generative Models for Inorganic Chemical Compositions

Oct 25, 2019
Yoshihide Sawada, Koji Morikawa, Mikiya Fujii

Generative models based on generative adversarial networks (GANs) and variational autoencoders (VAEs) have been widely studied in the fields of image generation, speech generation, and drug discovery, but, only a few studies have focused on the generation of inorganic materials. Such studies use the crystal structures of materials, but material researchers rarely store this information. Thus, we generate chemical compositions without using crystal information. We use a conditional VAE (CondVAE) and a conditional GAN (CondGAN) and show that CondGAN using the bag-of-atom representation with physical descriptors generates better compositions than other generative models. Also, we evaluate the effectiveness of the Metropolis-Hastings-based atomic valency modification and the extrapolation performance, which is important to material discovery.

* 10 pages 

  Access Paper or Ask Questions

Communication-based Evaluation for Natural Language Generation

Oct 11, 2019
Benjamin Newman, Reuben Cohn-Gordon, Christopher Potts

Natural language generation (NLG) systems are commonly evaluated using n-gram overlap measures (e.g. BLEU, ROUGE). These measures do not directly capture semantics or speaker intentions, and so they often turn out to be misaligned with our true goals for NLG. In this work, we argue instead for communication-based evaluations: assuming the purpose of an NLG system is to convey information to a reader/listener, we can directly evaluate its effectiveness at this task using the Rational Speech Acts model of pragmatic language use. We illustrate with a color reference dataset that contains descriptions in pre-defined quality categories, showing that our method better aligns with these quality categories than do any of the prominent n-gram overlap methods.

* 11 pages, 2 figures, SCiL, camera-ready - clarified certain points, updated acknowledgements 

  Access Paper or Ask Questions

Robust Audio Adversarial Example for a Physical Attack

Oct 28, 2018
Hiromu Yakura, Jun Sakuma

The success of deep learning in recent years has raised concerns about adversarial examples, which allow attackers to force deep neural networks to output a specified target. Although a method by which to generate audio adversarial examples targeting a state-of-the-art speech recognition model has been proposed, this method cannot fool the model in the case of playing over the air, and thus, the threat was considered to be limited. In this paper, we propose a method to generate adversarial examples that can attack even when playing over the air in the physical world by simulating transformation caused by playback or recording and incorporating them in the generation process. Evaluation and a listening experiment demonstrated that audio adversarial examples generated by the proposed method may become a real threat.

  Access Paper or Ask Questions

Discrete Structural Planning for Neural Machine Translation

Aug 14, 2018
Raphael Shu, Hideki Nakayama

Structural planning is important for producing long sentences, which is a missing part in current language generation models. In this work, we add a planning phase in neural machine translation to control the coarse structure of output sentences. The model first generates some planner codes, then predicts real output words conditioned on them. The codes are learned to capture the coarse structure of the target sentence. In order to obtain the codes, we design an end-to-end neural network with a discretization bottleneck, which predicts the simplified part-of-speech tags of target sentences. Experiments show that the translation performance are generally improved by planning ahead. We also find that translations with different structures can be obtained by manipulating the planner codes.

  Access Paper or Ask Questions