Abstract:Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation based on current English discourse coherenceneural network model. Specifically, to overcome the shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined modelsuccessfully investigatesthe entities information into the recursive neural network freamework.Evaluation results on both sentence ordering and machine translation coherence rating task show the effectiveness of the proposed model, which significantly outperforms the existing strong baseline.
Abstract:Identifying the different varieties of the same language is more challenging than unrelated languages identification. In this paper, we propose an approach to discriminate language varieties or dialects of Mandarin Chinese for the Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore, a.k.a., the Greater China Region (GCR). When applied to the dialects identification of the GCR, we find that the commonly used character-level or word-level uni-gram feature is not very efficient since there exist several specific problems such as the ambiguity and context-dependent characteristic of words in the dialects of the GCR. To overcome these challenges, we use not only the general features like character-level n-gram, but also many new word-level features, including PMI-based and word alignment-based features. A series of evaluation results on both the news and open-domain dataset from Wikipedia show the effectiveness of the proposed approach.