Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team

Feb 23, 2020
Qingjian Lin, Weicheng Cai, Lin Yang, Junjie Wang, Jun Zhang, Ming Li

In this paper, we present the submitted system for the second DIHARD Speech Diarization Challenge from the DKULENOVO team. Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection. For each module, we explore different techniques to enhance performance. Our final submission employs the ResNet-LSTM based VAD, the Deep ResNet based speaker embedding, the LSTM based similarity scoring and spectral clustering. Variational Bayes (VB) diarization is applied in the resegmentation stage and overlap detection also brings slight improvement. Our proposed system achieves 18.84% DER in Track1 and 27.90% DER in Track2. Although our systems have reduced the DERs by 27.5% and 31.7% relatively against the official baselines, we believe that the diarization task is still very difficult.

* Submitted to Odyssesy 2020 

  Access Paper or Ask Questions

Lost in Interpretation: Predicting Untranslated Terminology in Simultaneous Interpretation

Apr 01, 2019
Nikolai Vogler, Craig Stewart, Graham Neubig

Simultaneous interpretation, the translation of speech from one language to another in real-time, is an inherently difficult and strenuous task. One of the greatest challenges faced by interpreters is the accurate translation of difficult terminology like proper names, numbers, or other entities. Intelligent computer-assisted interpreting (CAI) tools that could analyze the spoken word and detect terms likely to be untranslated by an interpreter could reduce translation error and improve interpreter performance. In this paper, we propose a task of predicting which terminology simultaneous interpreters will leave untranslated, and examine methods that perform this task using supervised sequence taggers. We describe a number of task-specific features explicitly designed to indicate when an interpreter may struggle with translating a word. Experimental results on a newly-annotated version of the NAIST Simultaneous Translation Corpus (Shimizu et al., 2014) indicate the promise of our proposed method.

* NAACL 2019 

  Access Paper or Ask Questions

Improving Rotated Text Detection with Rotation Region Proposal Networks

Nov 16, 2018
Jing Huang, Viswanath Sivakumar, Mher Mnatsakanyan, Guan Pang

A significant number of images shared on social media platforms such as Facebook and Instagram contain text in various forms. It's increasingly becoming commonplace for bad actors to share misinformation, hate speech or other kinds of harmful content as text overlaid on images on such platforms. A scene-text understanding system should hence be able to handle text in various orientations that the adversary might use. Moreover, such a system can be incorporated into screen readers used to aid the visually impaired. In this work, we extend the scene-text extraction system at Facebook, Rosetta, to efficiently handle text in various orientations. Specifically, we incorporate the Rotation Region Proposal Networks (RRPN) in our text extraction pipeline and offer practical suggestions for building and deploying a model for detecting and recognizing text in arbitrary orientations efficiently. Experimental results show a significant improvement on detecting rotated text.

  Access Paper or Ask Questions

Text-Independent Speaker Verification Using Long Short-Term Memory Networks

Sep 07, 2018
Aryan Mobiny, Mohammad Najarian

In this paper, an architecture based on Long Short-Term Memory Networks has been proposed for the text-independent scenario which is aimed to capture the temporal speaker-related information by operating over traditional speech features. For speaker verification, at first, a background model must be created for speaker representation. Then, in enrollment stage, the speaker models will be created based on the enrollment utterances. For this work, the model will be trained in an end-to-end fashion to combine the first two stages. The main goal of end-to-end training is the model being optimized to be consistent with the speaker verification protocol. The end- to-end training jointly learns the background and speaker models by creating the representation space. The LSTM architecture is trained to create a discrimination space for validating the match and non-match pairs for speaker verification. The proposed architecture demonstrate its superiority in the text-independent compared to other traditional methods.

  Access Paper or Ask Questions

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

Apr 12, 2018
Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems. The objective of the challenge was to perform speaker conversion (i.e. transform the vocal identity) of a source speaker to a target speaker while maintaining linguistic information. As an update to the previous challenge, we considered both parallel and non-parallel data to form the Hub and Spoke tasks, respectively. A total of 23 teams from around the world submitted their systems, 11 of them additionally participated in the optional Spoke task. A large-scale crowdsourced perceptual evaluation was then carried out to rate the submitted converted speech in terms of naturalness and similarity to the target speaker identity. In this paper, we present a brief summary of the state-of-the-art techniques for VC, followed by a detailed explanation of the challenge tasks and the results that were obtained.

* Accepted for Speaker Odyssey 2018 

  Access Paper or Ask Questions

Sarcasm SIGN: Interpreting Sarcasm with Sentiment Based Monolingual Machine Translation

Apr 22, 2017
Lotem Peled, Roi Reichart

Sarcasm is a form of speech in which speakers say the opposite of what they truly mean in order to convey a strong sentiment. In other words, "Sarcasm is the giant chasm between what I say, and the person who doesn't get it.". In this paper we present the novel task of sarcasm interpretation, defined as the generation of a non-sarcastic utterance conveying the same message as the original sarcastic one. We introduce a novel dataset of 3000 sarcastic tweets, each interpreted by five human judges. Addressing the task as monolingual machine translation (MT), we experiment with MT algorithms and evaluation measures. We then present SIGN: an MT based sarcasm interpretation algorithm that targets sentiment words, a defining element of textual sarcasm. We show that while the scores of n-gram based automatic measures are similar for all interpretation models, SIGN's interpretations are scored higher by humans for adequacy and sentiment polarity. We conclude with a discussion on future research directions for our new task.

  Access Paper or Ask Questions

Variable Computation in Recurrent Neural Networks

Mar 02, 2017
Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov

Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video, speech or language) can have highly variable information flow, most recurrent models still consume input features at a constant rate and perform a constant number of computations per time step, which can be detrimental to both speed and model capacity. In this paper, we explore a modification to existing recurrent units which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure. We show experimentally that not only do our models require fewer operations, they also lead to better performance overall on evaluation tasks.

  Access Paper or Ask Questions

Training an Interactive Humanoid Robot Using Multimodal Deep Reinforcement Learning

Nov 26, 2016
Heriberto Cuayáhuitl, Guillaume Couly, Clément Olalainty

Training robots to perceive, act and communicate using multiple modalities still represents a challenging problem, particularly if robots are expected to learn efficiently from small sets of example interactions. We describe a learning approach as a step in this direction, where we teach a humanoid robot how to play the game of noughts and crosses. Given that multiple multimodal skills can be trained to play this game, we focus our attention to training the robot to perceive the game, and to interact in this game. Our multimodal deep reinforcement learning agent perceives multimodal features and exhibits verbal and non-verbal actions while playing. Experimental results using simulations show that the robot can learn to win or draw up to 98% of the games. A pilot test of the proposed multimodal system for the targeted game---integrating speech, vision and gestures---reports that reasonable and fluent interactions can be achieved using the proposed approach.

* NIPS Workshop on Future of Interactive Learning Machines, 2016 

  Access Paper or Ask Questions

Trainable Frontend For Robust and Far-Field Keyword Spotting

Jul 19, 2016
Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous

Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded noisy and far-field eval sets, we show that PCEN significantly improves recognition performance. Furthermore, we model PCEN as neural network layers and optimize high-dimensional PCEN parameters jointly with the keyword spotting acoustic model. The trained PCEN frontend demonstrates significant further improvements without increasing model complexity or inference-time cost.

  Access Paper or Ask Questions

PCA/LDA Approach for Text-Independent Speaker Recognition

Feb 25, 2016
Zhenhao Ge, Sudhendu R. Sharma, Mark J. T. Smith

Various algorithms for text-independent speaker recognition have been developed through the decades, aiming to improve both accuracy and efficiency. This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results. First, the performance based on only PCA and only LDA is measured; then a mixed model, taking advantages of both methods, is introduced. A subset of the TIMIT corpus composed of 200 male speakers, is used for enrollment, validation and testing. The best results achieve 100%; 96% and 95% classification rate at population level 50; 100 and 200, using 39-dimensional MFCC features with delta and double delta. These results are based on 12-second text-independent speech for training and 4-second data for test. These are comparable to the conventional MFCC-GMM methods, but require significantly less time to train and operate.

* Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series 

  Access Paper or Ask Questions