Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Saikat Chatterjee

Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

Aug 12, 2025

Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee

Abstract:Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.

* Accepted at IEEE ASRU 2025

Via

Access Paper or Ask Questions

Leveraging LLMs for Scalable Non-intrusive Speech Quality Assessment

Aug 08, 2025

Fredrik Cumlin, Xinyu Liang, Anubhab Ghosh, Saikat Chatterjee

Abstract:Non-intrusive speech quality assessment (SQA) systems suffer from limited training data and costly human annotations, hindering their generalization to real-time conferencing calls. In this work, we propose leveraging large language models (LLMs) as pseudo-raters for speech quality to address these data bottlenecks. We construct LibriAugmented, a dataset consisting of 101,129 speech clips with simulated degradations labeled by a fine-tuned auditory LLM (Vicuna-7b-v1.5). We compare three training strategies: using human-labeled data, using LLM-labeled data, and a two-stage approach (pretraining on LLM labels, then fine-tuning on human labels), using both DNSMOS Pro and DeePMOS. We test on several datasets across languages and quality degradations. While LLM-labeled training yields mixed results compared to human-labeled training, we provide empirical evidence that the two-stage approach improves the generalization performance (e.g., DNSMOS Pro achieves 0.63 vs. 0.55 PCC on NISQA_TEST_LIVETALK and 0.73 vs. 0.65 PCC on Tencent with reverb). Our findings demonstrate the potential of using LLMs as scalable pseudo-raters for speech quality assessment, offering a cost-effective solution to the data limitation problem.

* ECAI workshop paper

Via

Access Paper or Ask Questions

Multivariate Probabilistic Assessment of Speech Quality

Jun 05, 2025

Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee

Abstract:The mean opinion score (MOS) is a standard metric for assessing speech quality, but its singular focus fails to identify specific distortions when low scores are observed. The NISQA dataset addresses this limitation by providing ratings across four additional dimensions: noisiness, coloration, discontinuity, and loudness, alongside MOS. In this paper, we extend the explored univariate MOS estimation to a multivariate framework by modeling these dimensions jointly using a multivariate Gaussian distribution. Our approach utilizes Cholesky decomposition to predict covariances without imposing restrictive assumptions and extends probabilistic affine transformations to a multivariate context. Experimental results show that our model performs on par with state-of-the-art methods in point estimation, while uniquely providing uncertainty and correlation estimates across speech quality dimensions. This enables better diagnosis of poor speech quality and informs targeted improvements.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

Impairments are Clustered in Latents of Deep Neural Network-based Speech Quality Models

Apr 30, 2025

Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee

Abstract:In this article, we provide an experimental observation: Deep neural network (DNN) based speech quality assessment (SQA) models have inherent latent representations where many types of impairments are clustered. While DNN-based SQA models are not trained for impairment classification, our experiments show good impairment classification results in an appropriate SQA latent representation. We investigate the clustering of impairments using various kinds of audio degradations that include different types of noises, waveform clipping, gain transition, pitch shift, compression, reverberation, etc. To visualize the clusters we perform classification of impairments in the SQA-latent representation domain using a standard k-nearest neighbor (kNN) classifier. We also develop a new DNN-based SQA model, named DNSMOS+, to examine whether an improvement in SQA leads to an improvement in impairment classification. The classification accuracy is 94% for LibriAugmented dataset with 16 types of impairments and 54% for ESC-50 dataset with 50 types of real noises.

Via

Access Paper or Ask Questions

AI-Aided Kalman Filters

Oct 16, 2024

Nir Shlezinger, Guy Revach, Anubhab Ghosh, Saikat Chatterjee, Shuo Tang, Tales Imbiriba, Jindrich Dunik, Ondrej Straka, Pau Closas, Yonina C. Eldar

Abstract:The Kalman filter (KF) and its variants are among the most celebrated algorithms in signal processing. These methods are used for state estimation of dynamic systems by relying on mathematical representations in the form of simple state-space (SS) models, which may be crude and inaccurate descriptions of the underlying dynamics. Emerging data-centric artificial intelligence (AI) techniques tackle these tasks using deep neural networks (DNNs), which are model-agnostic. Recent developments illustrate the possibility of fusing DNNs with classic Kalman-type filtering, obtaining systems that learn to track in partially known dynamics. This article provides a tutorial-style overview of design approaches for incorporating AI in aiding KF-type algorithms. We review both generic and dedicated DNN architectures suitable for state estimation, and provide a systematic presentation of techniques for fusing AI tools with KFs and for leveraging partial SS modeling and data, categorizing design approaches into task-oriented and SS model-oriented. The usefulness of each approach in preserving the individual strengths of model-based KFs and data-driven DNNs is investigated in a qualitative and quantitative study, whose code is publicly available, illustrating the gains of hybrid model-based/data-driven designs. We also discuss existing challenges and future research directions that arise from fusing AI and Kalman-type algorithms.

* Submitted to IEEE Signal Processing Magazine

Via

Access Paper or Ask Questions

Near-Field ISAC in 6G: Addressing Phase Nonlinearity via Lifted Super-Resolution

Oct 07, 2024

Sajad Daei, Amirreza Zamani, Saikat Chatterjee, Mikael Skoglund, Gabor Fodor

Abstract:Integrated sensing and communications (ISAC) is a promising component of 6G networks, fusing communication and radar technologies to facilitate new services. Additionally, the use of extremely large-scale antenna arrays (ELLA) at the ISAC common receiver not only facilitates terahertz-rate communication links but also significantly enhances the accuracy of target detection in radar applications. In practical scenarios, communication scatterers and radar targets often reside in close proximity to the ISAC receiver. This, combined with the use of ELLA, fundamentally alters the electromagnetic characteristics of wireless and radar channels, shifting from far-field planar-wave propagation to near-field spherical wave propagation. Under the far-field planar-wave model, the phase of the array response vector varies linearly with the antenna index. In contrast, in the near-field spherical wave model, this phase relationship becomes nonlinear. This shift presents a fundamental challenge: the widely-used Fourier analysis can no longer be directly applied for target detection and communication channel estimation at the ISAC common receiver. In this work, we propose a feasible solution to address this fundamental issue. Specifically, we demonstrate that there exists a high-dimensional space in which the phase nonlinearity can be expressed as linear. Leveraging this insight, we develop a lifted super-resolution framework that simultaneously performs communication channel estimation and extracts target parameters with high precision.

Via

Access Paper or Ask Questions

Data-driven Bayesian State Estimation with Compressed Measurement of Model-free Process using Semi-supervised Learning

Jul 10, 2024

Anubhab Ghosh, Yonina C. Eldar, Saikat Chatterjee

Figure 1 for Data-driven Bayesian State Estimation with Compressed Measurement of Model-free Process using Semi-supervised Learning

Figure 2 for Data-driven Bayesian State Estimation with Compressed Measurement of Model-free Process using Semi-supervised Learning

Figure 3 for Data-driven Bayesian State Estimation with Compressed Measurement of Model-free Process using Semi-supervised Learning

Figure 4 for Data-driven Bayesian State Estimation with Compressed Measurement of Model-free Process using Semi-supervised Learning

Abstract:The research topic is: data-driven Bayesian state estimation with compressed measurement (BSCM) of model-free process, say for a (causal) tracking application. The dimension of the temporal measurement vector is lower than the dimension of the temporal state vector to be estimated. Hence the state estimation problem is an underdetermined inverse problem. The state-space-model (SSM) of the underlying dynamical process is assumed to be unknown and hence, we use the terminology 'model-free process'. In absence of the SSM, we can not employ traditional model-driven methods like Kalman Filter (KF) and Particle Filter (PF) and instead require data-driven methods. We first experimentally show that two existing unsupervised learning-based data-driven methods fail to address the BSCM problem for model-free process; they are data-driven nonlinear state estimation (DANSE) method and deep Markov model (DMM) method. The unsupervised learning uses unlabelled data comprised of only noisy measurements. While DANSE provides a good predictive performance to model the temporal measurement data as time-series, its unsupervised learning lacks a regularization for state estimation. We then investigate use of a semi-supervised learning approach, and develop a semi-supervised learning-based DANSE method, referred to as SemiDANSE. In the semi-supervised learning, we use a limited amount of labelled data along-with a large amount of unlabelled data, and that helps to bring the desired regularization for BSCM problem in the absence of SSM. The labelled data means pairwise measurement-and-state data. Using three chaotic dynamical systems (or processes) with nonlinear SSMs as benchmark, we show that the data-driven SemiDANSE provides competitive performance for BSCM against three SSM-informed methods - a hybrid method called KalmanNet, and two traditional model-driven methods called extended KF and unscented KF.

* 12 pages, under review at IEEE TSP. The abstract on ArXiv webpage is slightly abridged to respect the character limit, please check the pdf version for the unabridged version

Via

Access Paper or Ask Questions

Compressed Sensing of Generative Sparse-latent (GSL) Signals

Oct 16, 2023

Antoine Honoré, Anubhab Ghosh, Saikat Chatterjee

Figure 1 for Compressed Sensing of Generative Sparse-latent (GSL) Signals

Figure 2 for Compressed Sensing of Generative Sparse-latent (GSL) Signals

Figure 3 for Compressed Sensing of Generative Sparse-latent (GSL) Signals

Figure 4 for Compressed Sensing of Generative Sparse-latent (GSL) Signals

Abstract:We consider reconstruction of an ambient signal in a compressed sensing (CS) setup where the ambient signal has a neural network based generative model. The generative model has a sparse-latent input and we refer to the generated ambient signal as generative sparse-latent signal (GSL). The proposed sparsity inducing reconstruction algorithm is inherently non-convex, and we show that a gradient based search provides a good reconstruction performance. We evaluate our proposed algorithm using simulated data.

* Accepted at 31st European Signal Processing Conference, EUSIPCO 2023

Via

Access Paper or Ask Questions

DANSE: Data-driven Non-linear State Estimation of Model-free Process in Unsupervised Learning Setup

Jun 04, 2023

Anubhab Ghosh, Antoine Honoré, Saikat Chatterjee

Figure 1 for DANSE: Data-driven Non-linear State Estimation of Model-free Process in Unsupervised Learning Setup

Figure 2 for DANSE: Data-driven Non-linear State Estimation of Model-free Process in Unsupervised Learning Setup

Figure 3 for DANSE: Data-driven Non-linear State Estimation of Model-free Process in Unsupervised Learning Setup

Figure 4 for DANSE: Data-driven Non-linear State Estimation of Model-free Process in Unsupervised Learning Setup

Abstract:We address the tasks of Bayesian state estimation and forecasting for a model-free process in an unsupervised learning setup. In the article, we propose DANSE -- a Data-driven Nonlinear State Estimation method. DANSE provides a closed-form posterior of the state of the model-free process, given linear measurements of the state. In addition, it provides a closed-form posterior for forecasting. A data-driven recurrent neural network (RNN) is used in DANSE to provide the parameters of a prior of the state. The prior depends on the past measurements as input, and then we find the closed-form posterior of the state using the current measurement as input. The data-driven RNN captures the underlying non-linear dynamics of the model-free process. The training of DANSE, mainly learning the parameters of the RNN, is executed using an unsupervised learning approach. In unsupervised learning, we have access to a training dataset comprising only a set of measurement data trajectories, but we do not have any access to the state trajectories. Therefore, DANSE does not have access to state information in the training data and can not use supervised learning. Using simulated linear and non-linear process models (Lorenz attractor and Chen attractor), we evaluate the unsupervised learning-based DANSE. We show that the proposed DANSE, without knowledge of the process model and without supervised learning, provides a competitive performance against model-driven methods, such as the Kalman filter (KF), extended KF (EKF), unscented KF (UKF), and a recently proposed hybrid method called KalmanNet.

* 12 pages, The paper is under review

Via

Access Paper or Ask Questions

Automated Sentiment and Hate Speech Analysis of Facebook Data by Employing Multilingual Transformer Models

Jan 31, 2023

Ritumbra Manuvie, Saikat Chatterjee

Abstract:In recent years, there has been a heightened consensus within academia and in the public discourse that Social Media Platforms (SMPs), amplify the spread of hateful and negative sentiment content. Researchers have identified how hateful content, political propaganda, and targeted messaging contributed to real-world harms including insurrections against democratically elected governments, genocide, and breakdown of social cohesion due to heightened negative discourse towards certain communities in parts of the world. To counter these issues, SMPs have created semi-automated systems that can help identify toxic speech. In this paper we analyse the statistical distribution of hateful and negative sentiment contents within a representative Facebook dataset (n= 604,703) scrapped through 648 public Facebook pages which identify themselves as proponents (and followers) of far-right Hindutva actors. These pages were identified manually using keyword searches on Facebook and on CrowdTangleand classified as far-right Hindutva pages based on page names, page descriptions, and discourses shared on these pages. We employ state-of-the-art, open-source XLM-T multilingual transformer-based language models to perform sentiment and hate speech analysis of the textual contents shared on these pages over a period of 5.5 years. The result shows the statistical distributions of the predicted sentiment and the hate speech labels; top actors, and top page categories. We further discuss the benchmark performances and limitations of these pre-trained language models.

Via

Access Paper or Ask Questions