Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mika Hämäläinen

From Plenipotentiary to Puddingless: Users and Uses of New Words in Early English Letters

Mar 17, 2021

Tanja Säily, Eetu Mäkelä, Mika Hämäläinen

Figure 1 for From Plenipotentiary to Puddingless: Users and Uses of New Words in Early English Letters

Figure 2 for From Plenipotentiary to Puddingless: Users and Uses of New Words in Early English Letters

Figure 3 for From Plenipotentiary to Puddingless: Users and Uses of New Words in Early English Letters

Abstract:We study neologism use in two samples of early English correspondence, from 1640--1660 and 1760--1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter-writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.

* In Multilingual Facilitation (2021)

Via

Access Paper or Ask Questions

Endangered Languages are not Low-Resourced!

Mar 17, 2021

Mika Hämäläinen

Figure 1 for Endangered Languages are not Low-Resourced!

Abstract:The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.

* In Multilingual Facilitation (2021)

Via

Access Paper or Ask Questions

Speech Recognition for Endangered and Extinct Samoyedic languages

Dec 09, 2020

Niko Partanen, Mika Hämäläinen, Tiina Klooster

Figure 1 for Speech Recognition for Endangered and Extinct Samoyedic languages

Figure 2 for Speech Recognition for Endangered and Extinct Samoyedic languages

Figure 3 for Speech Recognition for Endangered and Extinct Samoyedic languages

Figure 4 for Speech Recognition for Endangered and Extinct Samoyedic languages

Abstract:Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic.

* the 34th Pacific Asia Conference on Language, Information and Computation

Via

Access Paper or Ask Questions

Normalization of Different Swedish Dialects Spoken in Finland

Dec 09, 2020

Mika Hämäläinen, Niko Partanen, Khalid Alnajjar

Figure 1 for Normalization of Different Swedish Dialects Spoken in Finland

Figure 2 for Normalization of Different Swedish Dialects Spoken in Finland

Figure 3 for Normalization of Different Swedish Dialects Spoken in Finland

Figure 4 for Normalization of Different Swedish Dialects Spoken in Finland

Abstract:Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study.

* In Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities (GeoHumanities'20)

Via

Access Paper or Ask Questions

Ve'rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement

Dec 04, 2020

Khalid Alnajjar, Mika Hämäläinen, Jack Rueter, Niko Partanen

Figure 1 for Ve'rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement

Abstract:We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.

* Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

Via

Access Paper or Ask Questions

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Nov 06, 2020

Quan Duong, Mika Hämäläinen, Simon Hengchen

Figure 1 for An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Figure 2 for An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Figure 3 for An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Figure 4 for An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Abstract:Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

Via

Access Paper or Ask Questions

Automated Prediction of Medieval Arabic Diacritics

Oct 11, 2020

Khalid Alnajjar, Mika Hämäläinen, Niko Partanen, Jack Rueter

Figure 1 for Automated Prediction of Medieval Arabic Diacritics

Figure 2 for Automated Prediction of Medieval Arabic Diacritics

Figure 3 for Automated Prediction of Medieval Arabic Diacritics

Figure 4 for Automated Prediction of Medieval Arabic Diacritics

Abstract:This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic. The results improve from the online tool used as a baseline. A diacritization model have been published openly through an easy to use Python package available on PyPi and Zenodo. We have found that context size should be considered when optimizing a feasible prediction model.

Via

Access Paper or Ask Questions

Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Sep 06, 2020

Mika Hämäläinen, Niko Partanen, Khalid Alnajjar, Jack Rueter, Thierry Poibeau

Figure 1 for Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Figure 2 for Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Figure 3 for Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Figure 4 for Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Abstract:We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.

* In proceedings of the Eleventh International Conference on Computational Creativity

Via

Access Paper or Ask Questions

Morphological Disambiguation of South Sámi with FSTs and Neural Networks

Apr 29, 2020

Mika Hämäläinen, Linda Wiechetek

Figure 1 for Morphological Disambiguation of South Sámi with FSTs and Neural Networks

Figure 2 for Morphological Disambiguation of South Sámi with FSTs and Neural Networks

Figure 3 for Morphological Disambiguation of South Sámi with FSTs and Neural Networks

Figure 4 for Morphological Disambiguation of South Sámi with FSTs and Neural Networks

Abstract:We present a method for conducting morphological disambiguation for South S\'ami, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North S\'ami UD Treebank and some synthetically generated South S\'ami data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North S\'ami training data for South S\'ami without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South S\'ami, which makes it usable and applicable in the contexts of any other endangered language as well.

* 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020)

Via

Access Paper or Ask Questions

FST Morphology for the Endangered Skolt Sami Language

Apr 09, 2020

Jack Rueter, Mika Hämäläinen

Figure 1 for FST Morphology for the Endangered Skolt Sami Language

Figure 2 for FST Morphology for the Endangered Skolt Sami Language

Figure 3 for FST Morphology for the Endangered Skolt Sami Language

Figure 4 for FST Morphology for the Endangered Skolt Sami Language

Abstract:We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.

* Accepted to The 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020)

Via

Access Paper or Ask Questions