Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Mar 13, 2018

Matthijs Van keirsbilck, Bert Moons, Marian Verhelst

Figure 1 for Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Figure 2 for Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Figure 3 for Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Figure 4 for Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Share this with someone who'll enjoy it:

Abstract:Today's Automatic Speech Recognition systems only rely on acoustic signals and often don't perform well under noisy conditions. Performing multi-modal speech recognition - processing acoustic speech signals and lip-reading video simultaneously - significantly enhances the performance of such systems, especially in noisy environments. This work presents the design of such an audio-visual system for Automated Speech Recognition, taking memory and computation requirements into account. First, a Long-Short-Term-Memory neural network for acoustic speech recognition is designed. Second, Convolutional Neural Networks are used to model lip-reading features. These are combined with an LSTM network to model temporal dependencies and perform automatic lip-reading on video. Finally, acoustic-speech and visual lip-reading networks are combined to process acoustic and visual features simultaneously. An attention mechanism ensures performance of the model in noisy environments. This system is evaluated on the TCD-TIMIT 'lipspeaker' dataset for audio-visual phoneme recognition with clean audio and with additive white noise at an SNR of 0dB. It achieves 75.70% and 58.55% phoneme accuracy respectively, over 14 percentage points better than the state-of-the-art for all noise levels.

* Tech. report

View paper on

Share this with someone who'll enjoy it:

Title:Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Paper and Code