Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Karen Avetisyan

A Simple and Effective Method of Cross-Lingual Plagiarism Detection

Apr 05, 2023

Karen Avetisyan, Arthur Malajyan, Tsolak Ghukasyan, Arutyun Avetisyan

Abstract:We present a simple cross-lingual plagiarism detection method applicable to a large number of languages. The presented approach leverages open multilingual thesauri for candidate retrieval task and pre-trained multilingual BERT-based language models for detailed analysis. The method does not rely on machine translation and word sense disambiguation when in use, and therefore is suitable for a large number of languages, including under-resourced languages. The effectiveness of the proposed approach is demonstrated for several existing and new benchmarks, achieving state-of-the-art results for French, Russian, and Armenian languages.

Via

Access Paper or Ask Questions

ARPA: Armenian Paraphrase Detection Corpus and Models

Sep 26, 2020

Arthur Malajyan, Karen Avetisyan, Tsolak Ghukasyan

Figure 1 for ARPA: Armenian Paraphrase Detection Corpus and Models

Figure 2 for ARPA: Armenian Paraphrase Detection Corpus and Models

Figure 3 for ARPA: Armenian Paraphrase Detection Corpus and Models

Figure 4 for ARPA: Armenian Paraphrase Detection Corpus and Models

Abstract:In this work, we employ a semi-automatic method based on back translation to generate a sentential paraphrase corpus for the Armenian language. The initial collection of sentences is translated from Armenian to English and back twice, resulting in pairs of lexically distant but semantically similar sentences. The generated paraphrases are then manually reviewed and annotated. Using the method train and test datasets are created, containing 2360 paraphrases in total. In addition, the datasets are used to train and evaluate BERTbased models for detecting paraphrase in Armenian, achieving results comparable to the state-of-the-art of other languages.

* To be published in the proceedings of Ivannikov Memorial Workshop 2020

Via

Access Paper or Ask Questions

Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Jun 07, 2019

Karen Avetisyan, Tsolak Ghukasyan

Figure 1 for Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Figure 2 for Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Figure 3 for Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Figure 4 for Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Abstract:In this work, we intrinsically and extrinsically evaluate and compare existing word embedding models for the Armenian language. Alongside, new embeddings are presented, trained using GloVe, fastText, CBOW, SkipGram algorithms. We adapt and use the word analogy task in intrinsic evaluation of embeddings. For extrinsic evaluation, two tasks are employed: morphological tagging and text classification. Tagging is performed on a deep neural network, using ArmTDP v2.3 dataset. For text classification, we propose a corpus of news articles categorized into 7 classes. The datasets are made public to serve as benchmarks for future models.

Via

Access Paper or Ask Questions

pioNER: Datasets and Baselines for Armenian Named Entity Recognition

Oct 19, 2018

Tsolak Ghukasyan, Garnik Davtyan, Karen Avetisyan, Ivan Andrianov

Figure 1 for pioNER: Datasets and Baselines for Armenian Named Entity Recognition

Figure 2 for pioNER: Datasets and Baselines for Armenian Named Entity Recognition

Figure 3 for pioNER: Datasets and Baselines for Armenian Named Entity Recognition

Figure 4 for pioNER: Datasets and Baselines for Armenian Named Entity Recognition

Abstract:In this work, we tackle the problem of Armenian named entity recognition, providing silver- and gold-standard datasets as well as establishing baseline results on popular models. We present a 163000-token named entity corpus automatically generated and annotated from Wikipedia, and another 53400-token corpus of news sentences with manual annotation of people, organization and location named entities. The corpora were used to train and evaluate several popular named entity recognition models. Alongside the datasets, we release 50-, 100-, 200-, 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.

* Accepted paper at Ivannikov ISP RAS Open Conference 2018. \c{opyright} 2018 IEEE

Via

Access Paper or Ask Questions