Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gaoze Hou

Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

May 08, 2025

Linrong Pan, Chenglong Jiang, Gaoze Hou, Ying Gao

Figure 1 for Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Figure 2 for Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Figure 3 for Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Figure 4 for Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Abstract:This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew speech data from multiple speakers, covering both formal and colloquial expressions, with precise orthographic and pinyin annotations. Additionally, we provide supplementary text processing tools and resources to propel research and applications in speech tasks for this low-resource language, such as automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our knowledge, this is the first publicly available Teochew dataset with accurate orthographic annotations. We conduct experiments on the corpus, and the results validate its effectiveness in ASR and TTS tasks.

Via

Access Paper or Ask Questions