Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Sequence Modeling via Segmentations

Jul 18, 2018
Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, Li Deng

Segmental structure is a common pattern in many types of sequences such as phrases in human languages. In this paper, we present a probabilistic model for sequences via their segmentations. The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks. Since the segmentation of a sequence is usually unknown in advance, we sum over all valid segmentations to obtain the final probability for the sequence. An efficient dynamic programming algorithm is developed for forward and backward computations without resorting to any approximation. We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts.

* recurrent neural networks, dynamic programming, structured prediction 

  Access Paper or Ask Questions

Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages

Jul 01, 2018
Manuel Mager, Elisabeth Mager, Alfonso Medina-Urrea, Ivan Meza, Katharina Kann

Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we study translations from three low-resource, polysynthetic languages (Nahuatl, Wixarika and Yorem Nokki) into Spanish and vice versa. Doing so, we find that in a morpheme-to-morpheme alignment an important amount of information contained in polysynthetic morphemes has no Spanish counterpart, and its translation is often omitted. We further conduct a qualitative analysis and, thus, identify morpheme types that are commonly hard to align or ignored in the translation process.

* To appear in "All Together Now? Computational Modeling of Polysynthetic Languages" Workshop, at COLING 2018 

  Access Paper or Ask Questions

Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation

May 18, 2018
Thi-Vinh Ngo, Thanh-Le Ha, Phuong-Thai Nguyen, Le-Minh Nguyen

Neural machine translation (NMT) systems have recently obtained state-of-the art in many machine translation systems between popular language pairs because of the availability of data. For low-resourced language pairs, there are few researches in this field due to the lack of bilingual data. In this paper, we attempt to build the first NMT systems for a low-resourced language pairs:Japanese-Vietnamese. We have also shown significant improvements when combining advanced methods to reduce the adverse impacts of data sparsity and improve the quality of NMT systems. In addition, we proposed a variant of Byte-Pair Encoding algorithm to perform effective word segmentation for Vietnamese texts and alleviate the rare-word problem that persists in NMT systems.

  Access Paper or Ask Questions

Towards ECDSA key derivation from deep embeddings for novel Blockchain applications

Nov 11, 2017
Christian S. Perone

In this work, we propose a straightforward method to derive Elliptic Curve Digital Signature Algorithm (ECDSA) key pairs from embeddings created using Deep Learning and Metric Learning approaches. We also show that these keys allows the derivation of cryptocurrencies (such as Bitcoin) addresses that can be used to transfer and receive funds, allowing novel Blockchain-based applications that can be used to transfer funds or data directly to domains such as image, text, sound or any other domain where Deep Learning can extract high-quality embeddings; providing thus a novel integration between the properties of the Blockchain-based technologies such as trust minimization and decentralization together with the high-quality learned representations from Deep Learning techniques.

* 7 pages, 5 figures 

  Access Paper or Ask Questions

Method for Aspect-Based Sentiment Annotation Using Rhetorical Analysis

Sep 13, 2017
Łukasz Augustyniak, Krzysztof Rajda, Tomasz Kajdanowicz

This paper fills a gap in aspect-based sentiment analysis and aims to present a new method for preparing and analysing texts concerning opinion and generating user-friendly descriptive reports in natural language. We present a comprehensive set of techniques derived from Rhetorical Structure Theory and sentiment analysis to extract aspects from textual opinions and then build an abstractive summary of a set of opinions. Moreover, we propose aspect-aspect graphs to evaluate the importance of aspects and to filter out unimportant ones from the summary. Additionally, the paper presents a prototype solution of data flow with interesting and valuable results. The proposed method's results proved the high accuracy of aspect detection when applied to the gold standard dataset.

  Access Paper or Ask Questions

Metalearning for Feature Selection

Mar 20, 2017
Ben Goertzel, Nil Geisweiller, Chris Poulin

A general formulation of optimization problems in which various candidate solutions may use different feature-sets is presented, encompassing supervised classification, automated program learning and other cases. A novel characterization of the concept of a "good quality feature" for such an optimization problem is provided; and a proposal regarding the integration of quality based feature selection into metalearning is suggested, wherein the quality of a feature for a problem is estimated using knowledge about related features in the context of related problems. Results are presented regarding extensive testing of this "feature metalearning" approach on supervised text classification problems; it is demonstrated that, in this context, feature metalearning can provide significant and sometimes dramatic speedup over standard feature selection heuristics.

  Access Paper or Ask Questions

DeepStance at SemEval-2016 Task 6: Detecting Stance in Tweets Using Character and Word-Level CNNs

Jun 17, 2016
Prashanth Vijayaraghavan, Ivan Sysoev, Soroush Vosoughi, Deb Roy

This paper describes our approach for the Detecting Stance in Tweets task (SemEval-2016 Task 6). We utilized recent advances in short text categorization using deep learning to create word-level and character-level models. The choice between word-level and character-level models in each particular case was informed through validation performance. Our final system is a combination of classifiers using word-level or character-level models. We also employed novel data augmentation techniques to expand and diversify our training dataset, thus making our system more robust. Our system achieved a macro-average precision, recall and F1-scores of 0.67, 0.61 and 0.635 respectively.

* SemEval 2016, San Diego, California. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). San Diego, California 

  Access Paper or Ask Questions

Sequence Level Training with Recurrent Neural Networks

May 06, 2016
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba

Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image. However, at test time the model is expected to generate the entire sequence from scratch. This discrepancy makes generation brittle, as errors may accumulate along the way. We address this issue by proposing a novel sequence level training algorithm that directly optimizes the metric used at test time, such as BLEU or ROUGE. On three different tasks, our approach outperforms several strong baselines for greedy generation. The method is also competitive when these baselines employ beam search, while being several times faster.

  Access Paper or Ask Questions

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2014

Sep 29, 2015
Krzysztof Wołk, Krzysztof Marasek

This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable corpora prepared by us. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use lemma and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase.

* Machine Translation, West slavic, Proceedings of the 11th International Workshop on Spoken Language Translation, Tahoe Lake, USA, 2014. arXiv admin note: text overlap with arXiv:1409.0473 by other authors 

  Access Paper or Ask Questions

Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach

Sep 24, 2014
Kathryn Baker, Michael Bloodgood, Chris Callison-Burch, Bonnie J. Dorr, Nathaniel W. Filardo, Lori Levin, Scott Miller, Christine Piatko

We describe a unified and coherent syntactic framework for supporting a semantically-informed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English translation task. This finding supports the hypothesis (posed by many researchers in the MT community, e.g., in DARPA GALE) that both syntactic and semantic information are critical for improving translation quality---and further demonstrates that large gains can be achieved for low-resource languages with different word order than English.

* In Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado, October 2010 
* 10 pages, 7 figures, 3 tables; appeared in Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas (AMTA), October 2010 

  Access Paper or Ask Questions