Abstract:Traditional audiometry often fails to fully characterize the functional impact of hearing loss on speech understanding, particularly supra-threshold deficits and frequency-specific perception challenges in conditions like presbycusis. This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. Our approach leverages ASR to simulate the perceptual effects of moderate sloping hearing loss by processing speech stimuli under controlled acoustic degradation and subsequently analyzing phoneme-level confusion patterns. Key findings indicate that simulated hearing loss introduces specific phoneme confusions, predominantly affecting high-frequency consonants (e.g., alveolar/palatal to labiodental substitutions) and leading to significant phoneme deletions, consistent with the acoustic cues degraded in presbycusis. A test battery curated from these ASR-derived confusions demonstrated diagnostic value, effectively differentiating between simulated normal-hearing and hearing-impaired listeners in a comprehensive simulation. This ASR-driven methodology offers a promising avenue for developing objective, granular, and frequency-specific hearing assessment tools that complement traditional audiometry. Future work will focus on validating these findings with human participants and exploring the integration of advanced AI models for enhanced diagnostic precision.
Abstract:Deep learning algorithm are increasingly used for speech enhancement (SE). In supervised methods, global and local information is required for accurate spectral mapping. A key restriction is often poor capture of key contextual information. To leverage long-term for target speakers and compensate distortions of cleaned speech, this paper adopts a sequence-to-sequence (S2S) mapping structure and proposes a novel monaural speech enhancement system, consisting of a Feature Extraction Block (FEB), a Compensation Enhancement Block (ComEB) and a Mask Block (MB). In the FEB a U-net block is used to extract abstract features using complex-valued spectra with one path to suppress the background noise in the magnitude domain using masking methods and the MB takes magnitude features from the FEBand compensates the lost complex-domain features produced from ComEB to restore the final cleaned speech. Experiments are conducted on the Librispeech dataset and results show that the proposed model obtains better performance than recent models in terms of ESTOI and PESQ scores.