Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Phonetic and Visual Priors for Decipherment of Informal Romanization

May 05, 2020

Maria Ryskina, Matthew R. Gormley, Taylor Berg-Kirkpatrick

Figure 1 for Phonetic and Visual Priors for Decipherment of Informal Romanization

Figure 2 for Phonetic and Visual Priors for Decipherment of Informal Romanization

Figure 3 for Phonetic and Visual Priors for Decipherment of Informal Romanization

Figure 4 for Phonetic and Visual Priors for Decipherment of Informal Romanization

Share this with someone who'll enjoy it:

Abstract:Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages---namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.

* To appear at ACL 2020

View paper on

Share this with someone who'll enjoy it:

Title:Phonetic and Visual Priors for Decipherment of Informal Romanization

Paper and Code