Topic:Vowel Classification
What is Vowel Classification? Vowel classification is the process of categorizing vowels or vowel sounds in speech signals.
Papers and Code
Jan 15, 2025
Abstract:Many voice disorders induce subharmonic phonation, but voice signal analysis is currently lacking a technique to detect the presence of subharmonics reliably. Distinguishing subharmonic phonation from normal phonation is a challenging task as both are nearly periodic phenomena. Subharmonic phonation adds cyclical variations to the normal glottal cycles. Hence, the estimation of subharmonic period requires a wholistic analysis of the signals. Deep learning is an effective solution to this type of complex problem. This paper describes fully convolutional neural networks which are trained with synthesized subharmonic voice signals to classify the subharmonic periods. Synthetic evaluation shows over 98% classification accuracy, and assessment of sustained vowel recordings demonstrates encouraging outcomes as well as the areas for future improvements.
* 9 pages, 8 figures, submitted to IEEE Trans. Audio Speech Lang.
Process
Via

Jan 08, 2025
Abstract:In clinical voice signal analysis, mishandling of subharmonic voicing may cause an acoustic parameter to signal false negatives. As such, the ability of a fundamental frequency estimator to identify speaking fundamental frequency is critical. This paper presents a sustained-vowel study, which used a quality-of-estimate classification to identify subharmonic errors and subharmonics-to-harmonics ratio (SHR) to measure the strength of subharmonic voicing. Five estimators were studied with a sustained vowel dataset: Praat, YAAPT, Harvest, CREPE, and FCN-F0. FCN-F0, a deep-learning model, performed the best both in overall accuracy and in correctly resolving subharmonic signals. CREPE and Harvest are also highly capable estimators for sustained vowel analysis.
* 9 pages, 6 figures
Via

Oct 25, 2024
Abstract:Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks.
* Submitted to ICASSP 2025
Via

Oct 05, 2024
Abstract:This paper introduces to a structured application of the One-Class approach and the One-Class-One-Network model for supervised classification tasks, specifically addressing a vowel phonemes classification case study within the Automatic Speech Recognition research field. Through pseudo-Neural Architecture Search and Hyper-Parameters Tuning experiments conducted with an informed grid-search methodology, we achieve classification accuracy comparable to nowadays complex architectures (90.0 - 93.7%). Despite its simplicity, our model prioritizes generalization of language context and distributed applicability, supported by relevant statistical and performance metrics. The experiments code is openly available at our GitHub.
* Accepted at "2024 29th IEEE Symposium on Computers and Communications
(ISCC): workshop on Next-Generation Multimedia Services at the Edge:
Leveraging 5G and Beyond (NGMSE2024)". arXiv admin note: text overlap with
arXiv:2410.04098
Via

Oct 05, 2024
Abstract:This paper explores a structured application of the One-Class approach and the One-Class-One-Network model for supervised classification tasks, focusing on vowel phonemes classification and speakers recognition for the Automatic Speech Recognition (ASR) domain. For our case-study, the ASR model runs on a proprietary sensing and lightning system, exploited to monitor acoustic and air pollution on urban streets. We formalize combinations of pseudo-Neural Architecture Search and Hyper-Parameters Tuning experiments, using an informed grid-search methodology, to achieve classification accuracy comparable to nowadays most complex architectures, delving into the speaker recognition and energy efficiency aspects. Despite its simplicity, our model proposal has a very good chance to generalize the language and speaker genders context for widespread applicability in computational constrained contexts, proved by relevant statistical and performance metrics. Our experiments code is openly accessible on our GitHub.
* in Proceedings of the 5th IEEE International Symposium on the
Internet of Sounds (IEEE IS2 2024, https://internetofsounds.net/is2_2024/)
* Accepted at "IEEE 5th International Symposium on the Internet of
Sounds, 30 Sep / 2 Oct 2024, Erlangen, Germany"
Via

Aug 27, 2024
Abstract:The evolution and diversity of a language is evident from it's various dialects. If the various dialects are not addressed in technological advancements like automatic speech recognition and speech synthesis, there is a chance that these dialects may disappear. Speech technology plays a role in preserving various dialects of a language from going extinct. In order to build a full fledged automatic speech recognition system that addresses various dialects, an Automatic Dialect Identification (ADI) system acting as the front end is required. This is similar to how language identification systems act as front ends to automatic speech recognition systems that handle multiple languages. The current work proposes a way to identify two popular and broadly classified Tamil dialects, namely literary and colloquial Tamil. Acoustical characteristics rather than phonetics and phonotactics are used, alleviating the requirement of language-dependant linguistic tools. Hence one major advantage of the proposed method is that it does not require an annotated corpus, hence it can be easily adapted to other languages. Gaussian Mixture Models (GMM) using Mel Frequency Cepstral Coefficient (MFCC) features are used to perform the classification task. The experiments yielded an error rate of 12%. Vowel nasalization, as being the reason for this good performance, is discussed. The number of mixture models for the GMM is varied and the performance is analysed.
* TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), Kochi,
India, 2019, pp. 1303-1306
* submitted to TENCON 2019
Via

Jun 10, 2024
Abstract:Chronic obstructive pulmonary disease (COPD) is a serious inflammatory lung disease affecting millions of people around the world. Due to an obstructed airflow from the lungs, it also becomes manifest in patients' vocal behaviour. Of particular importance is the detection of an exacerbation episode, which marks an acute phase and often requires hospitalisation and treatment. Previous work has shown that it is possible to distinguish between a pre- and a post-treatment state using automatic analysis of read speech. In this contribution, we examine whether sustained vowels can provide a complementary lens for telling apart these two states. Using a cohort of 50 patients, we show that the inclusion of sustained vowels can improve performance to up to 79\% unweighted average recall, from a 71\% baseline using read speech. We further identify and interpret the most important acoustic features that characterise the manifestation of COPD in sustained vowels.
* Accepted to INTERSPEECH 2024
Via

Jun 20, 2024
Abstract:Voice disorders are pathologies significantly affecting patient quality of life. However, non-invasive automated diagnosis of these pathologies is still under-explored, due to both a shortage of pathological voice data, and diversity of the recording types used for the diagnosis. This paper proposes a novel solution that adopts transformers directly working on raw voice signals and addresses data shortage through synthetic data generation and data augmentation. Further, we consider many recording types at the same time, such as sentence reading and sustained vowel emission, by employing a Mixture of Expert ensemble to align the predictions on different data types. The experimental results, obtained on both public and private datasets, show the effectiveness of our solution in the disorder detection and classification tasks and largely improve over existing approaches.
* Accepted at Interspeech 2024
Via

Feb 27, 2024
Abstract:On-chip photonic processors for neural networks have potential benefits in both speed and energy efficiency but have not yet reached the scale at which they can outperform electronic processors. The dominant paradigm for designing on-chip photonics is to make networks of relatively bulky discrete components connected by one-dimensional waveguides. A far more compact alternative is to avoid explicitly defining any components and instead sculpt the continuous substrate of the photonic processor to directly perform the computation using waves freely propagating in two dimensions. We propose and demonstrate a device whose refractive index as a function of space, $n(x,z)$, can be rapidly reprogrammed, allowing arbitrary control over the wave propagation in the device. Our device, a 2D-programmable waveguide, combines photoconductive gain with the electro-optic effect to achieve massively parallel modulation of the refractive index of a slab waveguide, with an index modulation depth of $10^{-3}$ and approximately $10^4$ programmable degrees of freedom. We used a prototype device with a functional area of $12\,\text{mm}^2$ to perform neural-network inference with up to 49-dimensional input vectors in a single pass, achieving 96% accuracy on vowel classification and 86% accuracy on $7 \times 7$-pixel MNIST handwritten-digit classification. This is a scale beyond that of previous photonic chips relying on discrete components, illustrating the benefit of the continuous-waves paradigm. In principle, with large enough chip area, the reprogrammability of the device's refractive index distribution enables the reconfigurable realization of any passive, linear photonic circuit or device. This promises the development of more compact and versatile photonic systems for a wide range of applications, including optical processing, smart sensing, spectroscopy, and optical communications.
Via

Dec 15, 2023
Abstract:Several sensing techniques have been proposed for silent speech recognition (SSR); however, many of these methods require invasive processes or sensor attachment to the skin using adhesive tape or glue, rendering them unsuitable for frequent use in daily life. By contrast, impulse radio ultra-wideband (IR-UWB) radar can operate without physical contact with users' articulators and related body parts, offering several advantages for SSR. These advantages include high range resolution, high penetrability, low power consumption, robustness to external light or sound interference, and the ability to be embedded in space-constrained handheld devices. This study demonstrated IR-UWB radar-based contactless SSR using four types of speech stimuli (vowels, consonants, words, and phrases). To achieve this, a novel speech feature extraction algorithm specifically designed for IR-UWB radar-based SSR is proposed. Each speech stimulus is recognized by applying a classification algorithm to the extracted speech features. Two different algorithms, multidimensional dynamic time warping (MD-DTW) and deep neural network-hidden Markov model (DNN-HMM), were compared for the classification task. Additionally, a favorable radar antenna position, either in front of the user's lips or below the user's chin, was determined to achieve higher recognition accuracy. Experimental results demonstrated the efficacy of the proposed speech feature extraction algorithm combined with DNN-HMM for classifying vowels, consonants, words, and phrases. Notably, this study represents the first demonstration of phoneme-level SSR using contactless radar.
* Submitted to IEEE Access
Via
