Search is one of the key functionalities in digital platforms and applications such as an electronic dictionary, a search engine, and an e-commerce platform. While the search function in some languages is trivial, Khmer word search is challenging given its complex writing system. Multiple orders of characters and different spelling realizations of words impose a constraint on Khmer word search functionality. Additionally, spelling mistakes are common since robust spellcheckers are not commonly available across the input device platforms. These challenges hinder the use of Khmer language in search-embedded applications. Moreover, due to the absence of WordNet-like lexical databases for Khmer language, it is impossible to establish semantic relation between words, enabling semantic search. In this paper, we propose a set of robust solutions to the above challenges associated with Khmer word search. The proposed solutions include character order normalization, grapheme and phoneme-based spellcheckers, and Khmer word semantic model. The semantic model is based on the word embedding model that is trained on a 30-million-word corpus and is used to capture the semantic similarities between words.
Text classification is one of the fundamental tasks in natural language processing to label an open-ended text and is useful for various applications such as sentiment analysis. In this paper, we discuss various classification approaches for Khmer text, ranging from a classical TF-IDF algorithm with support vector machine classifier to modern word embedding-based neural network classifiers including linear layer model, recurrent neural network and convolutional neural network. A Khmer word embedding model is trained on a 30-million-Khmer-word corpus to construct word vector representations that are used to train three different neural network classifiers. We evaluate the performance of different approaches on a news article dataset for both multi-class and multi-label text classification tasks. The result suggests that neural network classifiers using a word embedding model consistently outperform the traditional classifier using TF-IDF. The recurrent neural network classifier provides a slightly better result compared to the convolutional network and the linear layer network.
This paper presents an end-to-end deep convolutional recurrent neural network solution for Khmer optical character recognition (OCR) task. The proposed solution uses a sequence-to-sequence (Seq2Seq) architecture with attention mechanism. The encoder extracts visual features from an input text-line image via layers of residual convolutional blocks and a layer of gated recurrent units (GRU). The features are encoded in a single context vector and a sequence of hidden states which are fed to the decoder for decoding one character at a time until a special end-of-sentence (EOS) token is reached. The attention mechanism allows the decoder network to adaptively select parts of the input image while predicting a target character. The Seq2Seq Khmer OCR network was trained on a large collection of computer-generated text-line images for seven common Khmer fonts. The proposed model's performance outperformed the state-of-art Tesseract OCR engine for Khmer language on the 3000-images test set by achieving a character error rate (CER) of 1% vs 3%.
Khmer text is written from left to right with optional space. Space is not served as a word boundary but instead, it is used for readability or other functional purposes. Word segmentation is a prior step for downstream tasks such as part-of-speech (POS) tagging and thus, the robustness of POS tagging highly depends on word segmentation. The conventional Khmer POS tagging is a two-stage process that begins with word segmentation and then actual tagging of each word, afterward. In this work, a joint word segmentation and POS tagging approach using a single deep learning model is proposed so that word segmentation and POS tagging can be performed spontaneously. The proposed model was trained and tested using the publicly available Khmer POS dataset. The validation suggested that the performance of the joint model is on par with the conventional two-stage POS tagging.