Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu Tsao

Graduate Program of Data Science, National Taiwan University and Academia Sinica, Taipei, Taiwan, Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan

A Non-Intrusive Neural Quality Assessment Model for Surface Electromyography Signals

Feb 08, 2024

Cho-Yuan Lee, Kuan-Chen Wang, Kai-Chun Liu, Xugang Lu, Ping-Chen Yeh, Yu Tsao

Abstract:In practical scenarios involving the measurement of surface electromyography (sEMG) in muscles, particularly those areas near the heart, one of the primary sources of contamination is the presence of electrocardiogram (ECG) signals. To assess the quality of real-world sEMG data more effectively, this study proposes QASE-net, a new non-intrusive model that predicts the SNR of sEMG signals. QASE-net combines CNN-BLSTM with attention mechanisms and follows an end-to-end training strategy. Our experimental framework utilizes real-world sEMG and ECG data from two open-access databases, the Non-Invasive Adaptive Prosthetics Database and the MIT-BIH Normal Sinus Rhythm Database, respectively. The experimental results demonstrate the superiority of QASE-net over the previous assessment model, exhibiting significantly reduced prediction errors and notably higher linear correlations with the ground truth. These findings show the potential of QASE-net to substantially enhance the reliability and precision of sEMG quality assessment in practical applications.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising

Feb 06, 2024

Yu-Tung Liu, Kuan-Chen Wang, Kai-Chun Liu, Sheng-Yu Peng, Yu Tsao

Figure 1 for SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising

Figure 2 for SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising

Figure 3 for SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising

Figure 4 for SDEMG: Score-based Diffusion Model for Surface Electromyographic Signal Denoising

Abstract:Surface electromyography (sEMG) recordings can be influenced by electrocardiogram (ECG) signals when the muscle being monitored is close to the heart. Several existing methods use signal-processing-based approaches, such as high-pass filter and template subtraction, while some derive mapping functions to restore clean sEMG signals from noisy sEMG (sEMG with ECG interference). Recently, the score-based diffusion model, a renowned generative model, has been introduced to generate high-quality and accurate samples with noisy input data. In this study, we proposed a novel approach, termed SDEMG, as a score-based diffusion model for sEMG signal denoising. To evaluate the proposed SDEMG approach, we conduct experiments to reduce noise in sEMG signals, employing data from an openly accessible source, the Non-Invasive Adaptive Prosthetics database, along with ECG signals from the MIT-BIH Normal Sinus Rhythm Database. The experiment result indicates that SDEMG outperformed comparative methods and produced high-quality sEMG samples. The source code of SDEMG the framework is available at: https://github.com/tonyliu0910/SDEMG

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids

Jan 02, 2024

Dyah A. M. G. Wisnu, Epri Pratiwi, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, Yu Tsao

Figure 1 for HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids

Figure 2 for HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids

Figure 3 for HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids

Figure 4 for HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids

Abstract:This paper introduces HAAQI-Net, a non-intrusive deep learning model for music quality assessment tailored to hearing aid users. In contrast to traditional methods like the Hearing Aid Audio Quality Index (HAAQI), HAAQI-Net utilizes a Bidirectional Long Short-Term Memory (BLSTM) with attention. It takes an assessed music sample and a hearing loss pattern as input, generating a predicted HAAQI score. The model employs the pre-trained Bidirectional Encoder representation from Audio Transformers (BEATs) for acoustic feature extraction. Comparing predicted scores with ground truth, HAAQI-Net achieves a Longitudinal Concordance Correlation (LCC) of 0.9257, Spearman's Rank Correlation Coefficient (SRCC) of 0.9394, and Mean Squared Error (MSE) of 0.0080. Notably, this high performance comes with a substantial reduction in inference time: from 62.52 seconds (by HAAQI) to 2.71 seconds (by HAAQI-Net), serving as an efficient music quality assessment model for hearing aid users.

Via

Access Paper or Ask Questions

Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification

Dec 14, 2023

Haibin Wu, Heng-Cheng Kuo, Yu Tsao, Hung-yi Lee

Figure 1 for Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification

Figure 2 for Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification

Figure 3 for Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification

Figure 4 for Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification

Abstract:Automatic speaker verification (ASV) is highly susceptible to adversarial attacks. Purification modules are usually adopted as a pre-processing to mitigate adversarial noise. However, they are commonly implemented across diverse experimental settings, rendering direct comparisons challenging. This paper comprehensively compares mainstream purification techniques in a unified framework. We find these methods often face a trade-off between user experience and security, as they struggle to simultaneously maintain genuine sample performance and reduce adversarial perturbations. To address this challenge, some efforts have extended purification modules to encompass detection capabilities, aiming to alleviate the trade-off. However, advanced purification modules will always come into the stage to surpass previous detection method. As a result, we further propose an easy-to-follow ensemble approach that integrates advanced purification modules for detection, achieving state-of-the-art (SOTA) performance in countering adversarial noise. Our ensemble method has great potential due to its compatibility with future advanced purification techniques.

* Submitted to 2024 ICASSP

Via

Access Paper or Ask Questions

D4AM: A General Denoising Framework for Downstream Acoustic Models

Nov 28, 2023

Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen

Abstract:The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems. Our code is available at https://github.com/ChangLee0903/D4AM.

Via

Access Paper or Ask Questions

LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models

Nov 28, 2023

Chi-Chang Lee, Hong-Wei Chen, Chu-Song Chen, Hsin-Min Wang, Tsung-Te Liu, Yu Tsao

Abstract:The performance of speaker verification (SV) models may drop dramatically in noisy environments. A speech enhancement (SE) module can be used as a front-end strategy. However, existing SE methods may fail to bring performance improvements to downstream SV systems due to artifacts in the predicted signals of SE models. To compensate for artifacts, we propose a generic denoising framework named LC4SV, which can serve as a pre-processor for various unknown downstream SV models. In LC4SV, we employ a learning-based interpolation agent to automatically generate the appropriate coefficients between the enhanced signal and its noisy input to improve SV performance in noisy environments. Our experimental results demonstrate that LC4SV consistently improves the performance of various unseen SV systems. To the best of our knowledge, this work is the first attempt to develop a learning-based interpolation scheme aiming at improving SV performance in noisy environments.

Via

Access Paper or Ask Questions

Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

Nov 27, 2023

Yi-Heng Lin, Wen-Hsuan Tseng, Li-Chin Chen, Ching-Ting Tan, Yu Tsao

Figure 1 for Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

Figure 2 for Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

Figure 3 for Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

Figure 4 for Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

Abstract:The Consensus Auditory-Perceptual Evaluation of Voice is a widely employed tool in clinical voice quality assessment that is significant for streaming communication among clinical professionals and benchmarking for the determination of further treatment. Currently, because the assessment relies on experienced clinicians, it tends to be inconsistent, and thus, difficult to standardize. To address this problem, we propose to leverage lightly weighted automatic audio parameter extraction, to increase the clinical relevance, reduce the complexity, and enhance the interpretability of voice quality assessment. The proposed method utilizes age, sex, and five audio parameters: jitter, absolute jitter, shimmer, harmonic-to-noise ratio (HNR), and zero crossing. A classical machine learning approach is employed. The result reveals that our approach performs similar to state-of-the-art (SOTA) methods, and outperforms the latent representation obtained by using popular audio pre-trained models. This approach provide insights into the feasibility of different feature extraction approaches for voice evaluation. Audio parameters such as jitter and the HNR are proven to be suitable for characterizing voice quality attributes, such as roughness and strain. Conversely, pre-trained models exhibit limitations in effectively addressing noise-related scorings. This study contributes toward more comprehensive and precise voice quality evaluations, achieved by a comprehensively exploring diverse assessment methodologies.

* Published in IEEE 42th International Conference on Consumer Electronics (ICCE 2024)

Via

Access Paper or Ask Questions

Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

Nov 15, 2023

Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, John H. L. Hansen

Abstract:Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability.

Via

Access Paper or Ask Questions

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Nov 05, 2023

Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang

Figure 1 for AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Figure 2 for AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Figure 3 for AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Figure 4 for AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Abstract:Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multi-modal models that can exploit both pieces of information simultaneously. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multi-modal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.

Via

Access Paper or Ask Questions

Neural domain alignment for spoken language recognition based on optimal transport

Oct 20, 2023

Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

Figure 1 for Neural domain alignment for spoken language recognition based on optimal transport

Figure 2 for Neural domain alignment for spoken language recognition based on optimal transport

Figure 3 for Neural domain alignment for spoken language recognition based on optimal transport

Figure 4 for Neural domain alignment for spoken language recognition based on optimal transport

Abstract:Domain shift poses a significant challenge in cross-domain spoken language recognition (SLR) by reducing its effectiveness. Unsupervised domain adaptation (UDA) algorithms have been explored to address domain shifts in SLR without relying on class labels in the target domain. One successful UDA approach focuses on learning domain-invariant representations to align feature distributions between domains. However, disregarding the class structure during the learning process of domain-invariant representations can result in over-alignment, negatively impacting the classification task. To overcome this limitation, we propose an optimal transport (OT)-based UDA algorithm for a cross-domain SLR, leveraging the distribution geometry structure-aware property of OT. An OT-based discrepancy measure on a joint distribution over feature and label information is considered during domain alignment in OT-based UDA. Our previous study discovered that completely aligning the distributions between the source and target domains can introduce a negative transfer, where classes or irrelevant classes from the source domain map to a different class in the target domain during distribution alignment. This negative transfer degrades the performance of the adaptive model. To mitigate this issue, we introduce coupling-weighted partial optimal transport (POT) within our UDA framework for SLR, where soft weighting on the OT coupling based on transport cost is adaptively set during domain alignment. A cross-domain SLR task was used in the experiments to evaluate the proposed UDA. The results demonstrated that our proposed UDA algorithm significantly improved the performance over existing UDA algorithms in a cross-channel SLR task.

Via

Access Paper or Ask Questions