Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ella Rabinovich

CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

Aug 30, 2019
Ella Rabinovich, Masih Sultani, Suzanne Stevenson

Figure 1 for CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

Figure 2 for CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

Figure 3 for CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

Figure 4 for CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

In contrast to many decades of research on oral code-switching, the study of written multilingual productions has only recently enjoyed a surge of interest. Many open questions remain regarding the sociolinguistic underpinnings of written code-switching, and progress has been limited by a lack of suitable resources. We introduce a novel, large, and diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far. We investigate whether findings in oral code-switching concerning content and style, as well as speaker proficiency, are carried over into written code-switching in discussion forums. The released dataset can further facilitate a range of research and practical activities.

* EMNLP2019, 11 pages

Via

Access Paper or Ask Questions

Controversy in Context

Aug 20, 2019
Benjamin Sznajder, Ariel Gera, Yonatan Bilu, Dafna Sheinwald, Ella Rabinovich, Ranit Aharonov, David Konopnicki, Noam Slonim

With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.

* 5 pages

Via

Access Paper or Ask Questions

Learning Concept Abstractness Using Weak Supervision

Sep 05, 2018
Ella Rabinovich, Benjamin Sznajder, Artem Spector, Ilya Shnayderman, Ranit Aharonov, David Konopnicki, Noam Slonim

Figure 1 for Learning Concept Abstractness Using Weak Supervision

Figure 2 for Learning Concept Abstractness Using Weak Supervision

Figure 3 for Learning Concept Abstractness Using Weak Supervision

We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The results imply the applicability of this approach to additional properties of concepts, additional languages, and resource-scarce scenarios.

* 6 pages, EMNLP 2018

Via

Access Paper or Ask Questions

Native Language Cognate Effects on Second Language Lexical Choice

May 24, 2018
Ella Rabinovich, Yulia Tsvetkov, Shuly Wintner

We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages. We quantitatively analyze non-native lexical choice, highlighting cognate facilitation as one of the important phenomena shaping the language of non-native speakers.

* Transactions of the Association for Computational Linguistics (TACL), 2018; 14 pages

Via

Access Paper or Ask Questions

The UN Parallel Corpus Annotated for Translation Direction

May 20, 2018
Elad Tolochinsky, Ohad Mosafi, Ella Rabinovich, Shuly Wintner

Figure 1 for The UN Parallel Corpus Annotated for Translation Direction

Figure 2 for The UN Parallel Corpus Annotated for Translation Direction

Figure 3 for The UN Parallel Corpus Annotated for Translation Direction

Figure 4 for The UN Parallel Corpus Annotated for Translation Direction

This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods as well as the ability to distinguish between translated and original texts in the different languages. The annotated corpus is publicly available.

Via

Access Paper or Ask Questions

Found in Translation: Reconstructing Phylogenetic Language Trees from Translations

Apr 24, 2017
Ella Rabinovich, Noam Ordan, Shuly Wintner

Figure 1 for Found in Translation: Reconstructing Phylogenetic Language Trees from Translations

Figure 2 for Found in Translation: Reconstructing Phylogenetic Language Trees from Translations

Figure 3 for Found in Translation: Reconstructing Phylogenetic Language Trees from Translations

Figure 4 for Found in Translation: Reconstructing Phylogenetic Language Trees from Translations

Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it is possible to uncover the history of the source language by looking only at the translation. Specifically, we automatically reconstruct phylogenetic language trees from monolingual texts (translated from several source languages). The signal of the source language is so powerful that it is retained even after two phases of translation. This strongly indicates that source language interference is the most dominant characteristic of translated texts, overshadowing the more subtle signals of universal properties of translation.

* ACL2017, 11 pages

Via

Access Paper or Ask Questions

Personalized Machine Translation: Preserving Original Author Traits

Jan 12, 2017
Ella Rabinovich, Shachar Mirkin, Raj Nath Patel, Lucia Specia, Shuly Wintner

Figure 1 for Personalized Machine Translation: Preserving Original Author Traits

Figure 2 for Personalized Machine Translation: Preserving Original Author Traits

Figure 3 for Personalized Machine Translation: Preserving Original Author Traits

Figure 4 for Personalized Machine Translation: Preserving Original Author Traits

The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author's gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.

* EACL 2017, 11 pages

Via

Access Paper or Ask Questions

Unsupervised Identification of Translationese

Sep 11, 2016
Ella Rabinovich, Shuly Wintner

Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained on. We show that this is indeed the case, in a variety of evaluation scenarios. We then show that unsupervised classification is highly accurate on this task. We suggest a method for determining the correct labels of the clustering outcomes, and then use the labels for voting, improving the accuracy even further. Moreover, we suggest a simple method for clustering in the challenging case of mixed-domain datasets, in spite of the dominance of domain-related features over translation-related ones. The result is an effective, fully-unsupervised method for distinguishing between original and translated texts that can be applied to new domains with reasonable accuracy.

* TACL2015, 14 pages

Via

Access Paper or Ask Questions

On the Similarities Between Native, Non-native and Translated Texts

Sep 11, 2016
Ella Rabinovich, Sergiu Nisioi, Noam Ordan, Shuly Wintner

Figure 1 for On the Similarities Between Native, Non-native and Translated Texts

Figure 2 for On the Similarities Between Native, Non-native and Translated Texts

Figure 3 for On the Similarities Between Native, Non-native and Translated Texts

Figure 4 for On the Similarities Between Native, Non-native and Translated Texts

We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable; (2) non-native language and translations are closer to each other than each of them is to native language; and (3) some of these characteristics depend on the source or native language, while others do not, reflecting, perhaps, unified principles that similarly affect translations and non-native language.

* ACL2016, 12 pages

Via

Access Paper or Ask Questions

A Parallel Corpus of Translationese

Mar 06, 2016
Ella Rabinovich, Shuly Wintner, Ofek Luis Lewinsohn

Figure 1 for A Parallel Corpus of Translationese

Figure 2 for A Parallel Corpus of Translationese

Figure 3 for A Parallel Corpus of Translationese

Figure 4 for A Parallel Corpus of Translationese

We describe a set of bilingual English--French and English--German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.

Via

Access Paper or Ask Questions