Alert button
Picture for Sourav Bhattacharya

Sourav Bhattacharya

Alert button

Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition

Jul 12, 2023
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

Figure 1 for Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition
Figure 2 for Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition
Figure 3 for Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition
Figure 4 for Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition

Modern speech recognition systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but fail to consistently reach the same level of accuracy. In practice, however, the self-attention weights of trained speech recognizers take the form of a global average over time. This paper, therefore, proposes a linear-time alternative to self-attention for speech recognition. It summarises a whole utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ``Summary Mixing''. Introducing Summary Mixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while lowering the training and inference times by up to 27% and reducing the memory budget by a factor of two.

* Submitted to NeurIPS 2023 
Viaarxiv icon

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

Nov 08, 2022
Shucong Zhang, Malcolm Chadwick, Alberto Gil C. P. Ramos, Sourav Bhattacharya

Figure 1 for Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement
Figure 2 for Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross-attention approach outperforms competitive baselines consistently, even when our model is only approximately half the size.

Viaarxiv icon

Defensive Tensorization

Oct 26, 2021
Adrian Bulat, Jean Kossaifi, Sourav Bhattacharya, Yannis Panagakis, Timothy Hospedales, Georgios Tzimiropoulos, Nicholas D Lane, Maja Pantic

Figure 1 for Defensive Tensorization
Figure 2 for Defensive Tensorization
Figure 3 for Defensive Tensorization
Figure 4 for Defensive Tensorization

We propose defensive tensorization, an adversarial defence technique that leverages a latent high-order factorization of the network. The layers of a network are first expressed as factorized tensor layers. Tensor dropout is then applied in the latent subspace, therefore resulting in dense reconstructed weights, without the sparsity or perturbations typically induced by the randomization.Our approach can be readily integrated with any arbitrary neural architecture and combined with techniques like adversarial training. We empirically demonstrate the effectiveness of our approach on standard image classification benchmarks. We validate the versatility of our approach across domains and low-precision architectures by considering an audio classification task and binary networks. In all cases, we demonstrate improved performance compared to prior works.

* To be presented at BMVC 2021 
Viaarxiv icon

Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems

Aug 11, 2020
Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C. P. Ramos, Nicholas D. Lane

Figure 1 for Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems
Figure 2 for Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems
Figure 3 for Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems
Figure 4 for Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems

LPCNet is an efficient vocoder that combines linear prediction and deep neural network modules to keep the computational complexity low. In this work, we present two techniques to further reduce it's complexity, aiming for a low-cost LPCNet vocoder-based neural Text-to-Speech (TTS) System. These techniques are: 1) Sample-bunching, which allows LPCNet to generate more than one audio sample per inference; and 2) Bit-bunching, which reduces the computations in the final layer of LPCNet. With the proposed bunching techniques, LPCNet, in conjunction with a Deep Convolutional TTS (DCTTS) acoustic model, shows a 2.19x improvement over the baseline run-time when running on a mobile device, with a less than 0.1 decrease in TTS mean opinion score (MOS).

* Interspeech 2020 
Viaarxiv icon

Iterative Compression of End-to-End ASR Model using AutoML

Aug 06, 2020
Abhinav Mehrotra, Łukasz Dudziak, Jinsu Yeo, Young-yoon Lee, Ravichander Vipperla, Mohamed S. Abdelfattah, Sourav Bhattacharya, Samin Ishtiaq, Alberto Gil C. P. Ramos, SangJeong Lee, Daehyun Kim, Nicholas D. Lane

Figure 1 for Iterative Compression of End-to-End ASR Model using AutoML
Figure 2 for Iterative Compression of End-to-End ASR Model using AutoML
Figure 3 for Iterative Compression of End-to-End ASR Model using AutoML
Figure 4 for Iterative Compression of End-to-End ASR Model using AutoML

Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques. Past research have shown that AutoML-based Low Rank Factorization (LRF) technique, when applied to an end-to-end Encoder-Attention-Decoder style ASR model, can achieve a speedup of up to 3.7x, outperforming laborious manual rank-selection approaches. However, we show that current AutoML-based search techniques only work up to a certain compression level, beyond which they fail to produce compressed models with acceptable word error rates (WER). In this work, we propose an iterative AutoML-based LRF approach that achieves over 5x compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression.

* INTERSPEECH 2020  
Viaarxiv icon

MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors

Aug 21, 2019
Royson Lee, Stylianos I. Venieris, Łukasz Dudziak, Sourav Bhattacharya, Nicholas D. Lane

Figure 1 for MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors
Figure 2 for MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors
Figure 3 for MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors
Figure 4 for MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors

In recent years, convolutional networks have demonstrated unprecedented performance in the image restoration task of super-resolution (SR). SR entails the upscaling of a single low-resolution image in order to meet application-specific image quality demands and plays a key role in mobile devices. To comply with privacy regulations and reduce the overhead of cloud computing, executing SR models locally on-device constitutes a key alternative approach. Nevertheless, the excessive compute and memory requirements of SR workloads pose a challenge in mapping SR networks on resource-constrained mobile platforms. This work presents MobiSR, a novel framework for performing efficient super-resolution on-device. Given a target mobile platform, the proposed framework considers popular model compression techniques and traverses the design space to reach the highest performing trade-off between image quality and processing speed. At run time, a novel scheduler dispatches incoming image patches to the appropriate model-engine pair based on the patch's estimated upscaling difficulty in order to meet the required image quality with minimum processing latency. Quantitative evaluation shows that the proposed framework yields on-device SR designs that achieve an average speedup of 2.13x over highly-optimized parallel difficulty-unaware mappings and 4.79x over highly-optimized single compute engine implementations.

* Accepted at the 25th Annual International Conference on Mobile Computing and Networking (MobiCom), 2019 
Viaarxiv icon

Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data

Nov 29, 2017
Petar Veličković, Laurynas Karazija, Nicholas D. Lane, Sourav Bhattacharya, Edgar Liberis, Pietro Liò, Angela Chieh, Otmane Bellahsen, Matthieu Vegreville

Figure 1 for Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data
Figure 2 for Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data
Figure 3 for Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data
Figure 4 for Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data

We analyse multimodal time-series data corresponding to weight, sleep and steps measurements. We focus on predicting whether a user will successfully achieve his/her weight objective. For this, we design several deep long short-term memory (LSTM) architectures, including a novel cross-modal LSTM (X-LSTM), and demonstrate their superiority over baseline approaches. The X-LSTM improves parameter efficiency by processing each modality separately and allowing for information flow between them by way of recurrent cross-connections. We present a general hyperparameter optimisation technique for X-LSTMs, which allows us to significantly improve on the LSTM and a prior state-of-the-art cross-modal approach, using a comparable number of parameters. Finally, we visualise the model's predictions, revealing implications about latent variables in this task.

* To appear in NIPS ML4H 2017 and NIPS TSW 2017 
Viaarxiv icon

Towards Using Unlabeled Data in a Sparse-coding Framework for Human Activity Recognition

Jul 23, 2014
Sourav Bhattacharya, Petteri Nurmi, Nils Hammerla, Thomas Plötz

Figure 1 for Towards Using Unlabeled Data in a Sparse-coding Framework for Human Activity Recognition
Figure 2 for Towards Using Unlabeled Data in a Sparse-coding Framework for Human Activity Recognition
Figure 3 for Towards Using Unlabeled Data in a Sparse-coding Framework for Human Activity Recognition
Figure 4 for Towards Using Unlabeled Data in a Sparse-coding Framework for Human Activity Recognition

We propose a sparse-coding framework for activity recognition in ubiquitous and mobile computing that alleviates two fundamental problems of current supervised learning approaches. (i) It automatically derives a compact, sparse and meaningful feature representation of sensor data that does not rely on prior expert knowledge and generalizes extremely well across domain boundaries. (ii) It exploits unlabeled sample data for bootstrapping effective activity recognizers, i.e., substantially reduces the amount of ground truth annotation required for model estimation. Such unlabeled data is trivial to obtain, e.g., through contemporary smartphones carried by users as they go about their everyday activities. Based on the self-taught learning paradigm we automatically derive an over-complete set of basis vectors from unlabeled data that captures inherent patterns present within activity data. Through projecting raw sensor data onto the feature space defined by such over-complete sets of basis vectors effective feature extraction is pursued. Given these learned feature representations, classification backends are then trained using small amounts of labeled training data. We study the new approach in detail using two datasets which differ in terms of the recognition tasks and sensor modalities. Primarily we focus on transportation mode analysis task, a popular task in mobile-phone based sensing. The sparse-coding framework significantly outperforms the state-of-the-art in supervised learning approaches. Furthermore, we demonstrate the great practical potential of the new approach by successfully evaluating its generalization capabilities across both domain and sensor modalities by considering the popular Opportunity dataset. Our feature learning approach outperforms state-of-the-art approaches to analyzing activities in daily living.

* 18 pages, 12 figures, Pervasive and Mobile Computing, 2014 
Viaarxiv icon