Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing

Apr 12, 2021
Eric Guizzo, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia Medaglia, Giuseppe Nachira, Leonardo Nucciarelli, Ludovica Paglialunga, Marco Pennese, Sveva Pepe, Enrico Rocchi, Aurelio Uncini, Danilo Comminiello

The L3DAS21 Challenge is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing, with particular focus on 3D speech enhancement (SE) and 3D sound localization and detection (SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage. Usually, machine learning approaches to 3D audio tasks are based on single-perspective Ambisonics recordings or on arrays of single-capsule microphones. We propose, instead, a novel multichannel audio configuration based multiple-source and multiple-perspective Ambisonics recordings, performed with an array of two first-order Ambisonics microphones. To the best of our knowledge, it is the first time that a dual-mic Ambisonics configuration is used for these tasks. We provide baseline models and results for both tasks, obtained with state-of-the-art architectures: FaSNet for SE and SELDNet for SELD. This report is aimed at providing all needed information to participate in the L3DAS21 Challenge, illustrating the details of the L3DAS21 dataset, the challenge tasks and the baseline models.

* Documentation paper for the L3DAS21 Challenge for IEEE MLSP 2021. Further information on www.l3das.com/mlsp2021 

  Access Paper or Ask Questions

Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASR

Apr 10, 2021
Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Leijing Hou, Shilei Zhang

Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition and achieved competitive performance compared with other NAR methods. However, such an alignment learning strategy may also result in inaccurate acoustic boundary estimation and deceleration in convergence speed. To eliminate these drawbacks and improve performance further, we incorporate an additional connectionist temporal classification (CTC) based alignment loss and a contextual decoder into the CIF-based NAR model. Specifically, we use the CTC spike information to guide the leaning of acoustic boundary and adopt a new contextual decoder to capture the linguistic dependencies within a sentence in the conventional CIF model. Besides, a recently proposed Conformer architecture is also employed to model both local and global acoustic dependencies. Experiments on the open-source Mandarin corpora AISHELL-1 show that the proposed method achieves a comparable character error rate (CER) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model.

* 5 pages,4 figures 

  Access Paper or Ask Questions

A Hybrid CNN-BiLSTM Voice Activity Detector

Mar 05, 2021
Nicholas Wilkinson, Thomas Niesler

This paper presents a new hybrid architecture for voice activity detection (VAD) incorporating both convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) layers trained in an end-to-end manner. In addition, we focus specifically on optimising the computational efficiency of our architecture in order to deliver robust performance in difficult in-the-wild noise conditions in a severely under-resourced setting. Nested k-fold cross-validation was used to explore the hyperparameter space, and the trade-off between optimal parameters and model size is discussed. The performance effect of a BiLSTM layer compared to a unidirectional LSTM layer was also considered. We compare our systems with three established baselines on the AVA-Speech dataset. We find that significantly smaller models with near optimal parameters perform on par with larger models trained with optimal parameters. BiLSTM layers were shown to improve accuracy over unidirectional layers by $\approx$2% absolute on average. With an area under the curve (AUC) of 0.951, our system outperforms all baselines, including a much larger ResNet system, particularly in difficult noise conditions.

* ICASSP 2021 

  Access Paper or Ask Questions

Amplitude Demodulation of Wideband Signals

Feb 09, 2021
Mantas Gabrielaitis

Amplitude demodulation is a classical operation used in signal processing. For a long time, its effective applications in practice have been limited to narrowband signals. In this work, we generalize amplitude demodulation to wideband signals. We pose demodulation as a recovery problem of an oversampled corrupted signal and introduce special iterative schemes belonging to the family of alternating projection algorithms to solve it. Sensibly chosen structural assumptions on the demodulation outputs allow us to reveal the high inferential accuracy of the method over a rich set of relevant signals. This new approach surpasses current state-of-the-art demodulation techniques apt to wideband signals in computational efficiency by up to many orders of magnitude with no sacrifice in quality. Such performance opens the door for applications of the amplitude demodulation procedure in new contexts. In particular, the new method makes online and large-scale offline data processing feasible, including the calculation of modulator-carrier pairs in higher dimensions and poor sampling conditions, independent of the signal bandwidth. We illustrate the utility and specifics of applications of the new method in practice by using synthetic and natural speech signals.


  Access Paper or Ask Questions

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

Jan 27, 2021
Pranay Gupta, Divyanshu Sharma, Ravi Kiran Sarvadevabhatla

We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large-scale NTU-60 and NTU-120 skeleton action datasets with multiple splits. Our results demonstrate SynSE's state of the art performance in both ZSL and GZSL settings compared to strong baselines on the NTU-60 and NTU-120 datasets.

* Code and pretrained models available at https://github.com/skelemoa/synse-zsl 

  Access Paper or Ask Questions

GREEK-BERT: The Greeks visiting Sesame Street

Sep 03, 2020
John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis, Ion Androutsopoulos

Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural language processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models have mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT, a monolingual BERT-based language model for modern Greek. We evaluate its performance in three NLP tasks, i.e., part-of-speech tagging, named entity recognition, and natural language inference, obtaining state-of-the-art performance. Interestingly, in two of the benchmarks GREEK-BERT outperforms two multilingual Transformer-based models (M-BERT, XLM-R), as well as shallower neural baselines operating on pre-trained word embeddings, by a large margin (5%-10%). Most importantly, we make both GREEK-BERT and our training code publicly available, along with code illustrating how GREEK-BERT can be fine-tuned for downstream NLP tasks. We expect these resources to boost NLP research and applications for modern Greek.

* 8 pages, 1 figure, 11th Hellenic Conference on Artificial Intelligence (SETN 2020) 

  Access Paper or Ask Questions

Robust Prediction of Punctuation and Truecasing for Medical ASR

Jul 11, 2020
Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, Katrin Kirchhoff

Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or "exclamation point", while truecasing enhances user readability and improves the performance of downstream NLP tasks. This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing using pretrained masked language models such as BERT, BioBERT and RoBERTa. We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data. Finally, we improve the robustness of the model against common errors made in ASR by performing data augmentation. Experiments performed on dictation and conversational style corpora show that our proposed model achieves ~5% absolute improvement on ground truth text and ~10% improvement on ASR outputs over baseline models under F1 metric.

* Accepted for ACL NLPMC workshop 2020 

  Access Paper or Ask Questions

Robust Prediction of Punctuation and Truecasingfor Medical ASR

Jul 04, 2020
Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, Katrin Kirchhoff

Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or "exclamation point", while truecasing enhances user readability and improves the performance of downstream NLP tasks. This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing using pretrained masked language models such as BERT, BioBERT and RoBERTa. We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data. Finally, we improve the robustness of the model against common errors made in ASR by performing data augmentation. Experiments performed on dictation and conversational style corpora show that our proposed model achieves ~5% absolute improvement on ground truth text and ~10% improvement on ASR outputs over baseline models under F1 metric.

* Accepted for ACL NLPMC workshop 2020 

  Access Paper or Ask Questions

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

May 15, 2020
Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li, Yifan Gong

Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and (4) directly minimize the differentiable expected latency loss. Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side. We also present some analysis to understand the behaviors of streaming S2S models.

* Accepted at IEEE ICASSP 2020 

  Access Paper or Ask Questions

<<
735
736
737
738
739
740
741
742
743
744
745
746
747
>>