Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Leveraging Unstructured Data to Detect Emerging Reliability Issues

Jul 26, 2016
Deovrat Kakde, Arin Chaudhuri

Unstructured data refers to information that does not have a predefined data model or is not organized in a pre-defined manner. Loosely speaking, unstructured data refers to text data that is generated by humans. In after-sales service businesses, there are two main sources of unstructured data: customer complaints, which generally describe symptoms, and technician comments, which outline diagnostics and treatment information. A legitimate customer complaint can eventually be tracked to a failure or a claim. However, there is a delay between the time of a customer complaint and the time of a failure or a claim. A proactive strategy aimed at analyzing customer complaints for symptoms can help service providers detect reliability problems in advance and initiate corrective actions such as recalls. This paper introduces essential text mining concepts in the context of reliability analysis and a method to detect emerging reliability issues. The application of the method is illustrated using a case study.

  Access Paper or Ask Questions

Klasifikasi Komponen Argumen Secara Otomatis pada Dokumen Teks berbentuk Esai Argumentatif

Dec 02, 2015
Derwin Suhartono

By automatically recognize argument component, essay writers can do some inspections to texts that they have written. It will assist essay scoring process objectively and precisely because essay grader is able to see how well the argument components are constructed. Some reseachers have tried to do argument detection and classification along with its implementation in some domains. The common approach is by doing feature extraction to the text. Generally, the features are structural, lexical, syntactic, indicator, and contextual. In this research, we add new feature to the existing features. It adopts keywords list by Knott and Dale (1993). The experiment result shows the argument classification achieves 72.45% accuracy. Moreover, we still get the same accuracy without the keyword lists. This concludes that the keyword lists do not affect significantly to the features. All features are still weak to classify major claim and claim, so we need other features which are useful to differentiate those two kind of argument components.

* 16 pages, 3 figures, 2 tables, Technical Report Program Studi Doktor Ilmu Komputer Universitas Indonesia 

  Access Paper or Ask Questions

Document Embedding with Paragraph Vectors

Jul 29, 2015
Andrew M. Dai, Christopher Olah, Quoc V. Le

Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

* 8 pages 

  Access Paper or Ask Questions

Associative Measures and Multi-word Unit Extraction in Turkish

Jul 15, 2015
Umit Mersinli

Associative measures are "mathematical formulas determining the strength of association between two or more words based on their occurrences and cooccurrences in a text corpus" (Pecina, 2010, p. 138). The purpose of this paper is to test the 12 associative measures that Text-NSP (Banerjee & Pedersen, 2003) contains on a 10-million-word subcorpus of Turkish National Corpus (TNC) (Aksan, 2012). A statistical comparison of those measures is out of the scope of the study, and the measures will be evaluated according to the linguistic relevance of the rankings they provide. The focus of the study is basically on optimizing the corpus data, before applying the measures and then, evaluating the rankings produced by these measures as a whole, not on the linguistic relevance of individual n-grams. The findings include intra-linguistically relevant associative measures for a comma delimited, sentence splitted, lower-cased, well-balanced, representative, 10-million-word corpus of Turkish.

* Associative Measures and Multi-word Unit Extraction in Turkish. Dil ve Edebiyat Dergisi. 12(1). 43-61 

  Access Paper or Ask Questions

Learning New Facts From Knowledge Bases With Neural Tensor Networks and Semantic Word Vectors

Mar 16, 2013
Danqi Chen, Richard Socher, Christopher D. Manning, Andrew Y. Ng

Knowledge bases provide applications with the benefit of easily accessible, systematic relational knowledge but often suffer in practice from their incompleteness and lack of knowledge of new entities and relations. Much work has focused on building or extending them by finding patterns in large unannotated text corpora. In contrast, here we mainly aim to complete a knowledge base by predicting additional true relationships between entities, based on generalizations that can be discerned in the given knowledgebase. We introduce a neural tensor network (NTN) model which predicts new relationship entries that can be added to the database. This model can be improved by initializing entity representations with word vectors learned in an unsupervised fashion from text, and when doing this, existing relations can even be queried for entities that were not present in the database. Our model generalizes and outperforms existing models for this problem, and can classify unseen relationships in WordNet with an accuracy of 75.8%.

  Access Paper or Ask Questions

Authorship Identification in Bengali Literature: a Comparative Analysis

Feb 24, 2013
Tanmoy Chakraborty

Stylometry is the study of the unique linguistic styles and writing behaviors of individuals. It belongs to the core task of text categorization like authorship identification, plagiarism detection etc. Though reasonable number of studies have been conducted in English language, no major work has been done so far in Bengali. In this work, We will present a demonstration of authorship identification of the documents written in Bengali. We adopt a set of fine-grained stylistic features for the analysis of the text and use them to develop two different models: statistical similarity model consisting of three measures and their combination, and machine learning model with Decision Tree, Neural Network and SVM. Experimental results show that SVM outperforms other state-of-the-art methods after 10-fold cross validations. We also validate the relative importance of each stylistic feature to show that some of them remain consistently significant in every model used in this experiment.

* Chakraborty, T., Authorship Identification in Bengali Literature: a Comparative Analysis, Proceedings of COLING 2012: Demonstration Papers, December, 2012, pp. 41-50 
* 9 pages, 5 tables, 4 pictures 

  Access Paper or Ask Questions

Directed Replacement

Jun 23, 1996
Lauri Karttunen

This paper introduces to the finite-state calculus a family of directed replace operators. In contrast to the simple replace expression, UPPER -> LOWER, defined in Karttunen (ACL-95), the new directed version, UPPER @-> LOWER, yields an unambiguous transducer if the lower language consists of a single string. It transduces the input string from left to right, making only the longest possible replacement at each point. A new type of replacement expression, UPPER @-> PREFIX ... SUFFIX, yields a transducer that inserts text around strings that are instances of UPPER. The symbol ... denotes the matching part of the input which itself remains unchanged. PREFIX and SUFFIX are regular expressions describing the insertions. Expressions of the type UPPER @-> PREFIX ... SUFFIX may be used to compose a deterministic parser for a ``local grammar'' in the sense of Gross (1989). Other useful applications of directed replacement include tokenization and filtering of text streams.

* To appear in the Proceedings of ACL-96 

  Access Paper or Ask Questions

CINO: A Chinese Minority Pre-trained Language Model

Feb 28, 2022
Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang, Min Lin, Dayong Wu, Zhigang Chen

Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the existing multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Cantonese, and six other Chinese minority languages. To evaluate the cross-lingual ability of the multilingual models on the minority languages, we collect documents from Wikipedia and build a text classification dataset WCM (Wiki-Chinese-Minority). We test CINO on WCM and two other text classification tasks. Experiments show that CINO outperforms the baselines notably. The CINO model and the WCM dataset are available at

* 4 pages 

  Access Paper or Ask Questions

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Oct 09, 2021
Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization. The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence. Experimental results show that VisualTTS achieves accurate lip-speech synchronization and outperforms all baseline systems.

* Submitted to ICASSP 2022 

  Access Paper or Ask Questions

WebQA: Multihop and Multimodal QA

Sep 21, 2021
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, Yonatan Bisk

Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources -- akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.

  Access Paper or Ask Questions