Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mahta Fetrat Qharabagh

Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

May 19, 2025

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

Abstract:Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.

* 8 main body pages, total 25 pages, 15 figures

Via

Access Paper or Ask Questions

LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Sep 13, 2024

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

Figure 1 for LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Figure 2 for LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Figure 3 for LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Figure 4 for LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Abstract:Grapheme-to-phoneme (G2P) conversion is critical in speech processing, particularly for applications like speech synthesis. G2P systems must possess linguistic understanding and contextual awareness of languages with polyphone words and context-dependent phonemes. Large language models (LLMs) have recently demonstrated significant potential in various language tasks, suggesting that their phonetic knowledge could be leveraged for G2P. In this paper, we evaluate the performance of LLMs in G2P conversion and introduce prompting and post-processing methods that enhance LLM outputs without additional training or labeled data. We also present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language. Our results show that by applying the proposed methods, LLMs can outperform traditional G2P tools, even in an underrepresented language like Persian, highlighting the potential of developing LLM-aided G2P systems.

* 5 pages, 5 figures

Via

Access Paper or Ask Questions

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Sep 11, 2024

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

Figure 1 for ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Figure 2 for ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Figure 3 for ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Figure 4 for ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Abstract:In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.

* 33 pages, 12 figures

Via

Access Paper or Ask Questions