Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Architecture and evolution of semantic networks in mathematics texts

Aug 14, 2019
Nicolas H. Christianson, Ann Sizemore Blevins, Danielle S. Bassett

Knowledge is a network of interconnected concepts. Yet, precisely how the topological structure of knowledge constrains its acquisition remains unknown, hampering the development of learning enhancement strategies. Here we study the topological structure of semantic networks reflecting mathematical concepts and their relations in college-level linear algebra texts. We hypothesize that these networks will exhibit structural order, reflecting the logical sequence of topics that ensures accessibility. We find that the networks exhibit strong core-periphery architecture, where a dense core of concepts presented early is complemented with a sparse periphery presented evenly throughout the exposition; the latter is composed of many small modules each reflecting more narrow domains. Using tools from applied topology, we find that the expositional evolution of the semantic networks produces and subsequently fills knowledge gaps, and that the density of these gaps tracks negatively with community ratings of each textbook. Broadly, our study lays the groundwork for future efforts developing optimal design principles for textbook exposition and teaching in a classroom setting.

* 17 pages, 5 figures 

  Access Paper or Ask Questions

Towards automatic identification of linguistic politeness in Hindi texts

Nov 30, 2021
Ritesh Kumar

In this paper I present a classifier for automatic identification of linguistic politeness in Hindi texts. I have used the manually annotated corpus of over 25,000 blog comments to train an SVM. Making use of the discursive and interactional approaches to politeness the paper gives an exposition of the normative, conventionalised politeness structures of Hindi. It is seen that using these manually recognised structures as features in training the SVM significantly improves the performance of the classifier on the test set. The trained system gives a significantly high accuracy of over 77% which is within 2% of human accuracy.

* Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 13), pp. 386 - 390, 2013 

  Access Paper or Ask Questions

AutoMATES: Automated Model Assembly from Text, Equations, and Software

Jan 21, 2020
Adarsh Pyarelal, Marco A. Valenzuela-Escarcega, Rebecca Sharp, Paul D. Hein, Jon Stephens, Pratik Bhandari, HeuiChan Lim, Saumya Debray, Clayton T. Morrison

Models of complicated systems can be represented in different ways - in scientific papers, they are represented using natural language text as well as equations. But to be of real use, they must also be implemented as software, thus making code a third form of representing models. We introduce the AutoMATES project, which aims to build semantically-rich unified representations of models from scientific code and publications to facilitate the integration of computational models from different domains and allow for modeling large, complicated systems that span multiple domains and levels of abstraction.

* 8 pages, 6 figures, accepted to Modeling the World's Systems 2019 

  Access Paper or Ask Questions

A Method for Open-Vocabulary Speech-Driven Text Retrieval

Jun 09, 2002
Atsushi Fujii, Katunobu Itou, Tetsuya Ishikawa

While recent retrieval techniques do not limit the number of index terms, out-of-vocabulary (OOV) words are crucial in speech recognition. Aiming at retrieving information with spoken queries, we fill the gap between speech recognition and text retrieval in terms of the vocabulary size. Given a spoken query, we generate a transcription and detect OOV words through speech recognition. We then correspond detected OOV words to terms indexed in a target collection to complete the transcription, and search the collection for documents relevant to the completed transcription. We show the effectiveness of our method by way of experiments.

* Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), pp.188-195, July. 2002 
* Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (To appear) 

  Access Paper or Ask Questions

Text Classification Models for Form Entity Linking

Dec 14, 2021
María Villota, César Domínguez, Jónathan Heras, Eloy Mata, Vico Pascual

Forms are a widespread type of template-based document used in a great variety of fields including, among others, administration, medicine, finance, or insurance. The automatic extraction of the information included in these documents is greatly demanded due to the increasing volume of forms that are generated in a daily basis. However, this is not a straightforward task when working with scanned forms because of the great diversity of templates with different location of form entities, and the quality of the scanned documents. In this context, there is a feature that is shared by all forms: they contain a collection of interlinked entities built as key-value (or label-value) pairs, together with other entities such as headers or images. In this work, we have tacked the problem of entity linking in forms by combining image processing techniques and a text classification model based on the BERT architecture. This approach achieves state-of-the-art results with a F1-score of 0.80 on the FUNSD dataset, a 5% improvement regarding the best previous method. The code of this project is available at

  Access Paper or Ask Questions

Unsupervised Construction of Knowledge Graphs From Text and Code

Aug 25, 2019
Kun Cao, James Fairbanks

The scientific literature is a rich source of information for data mining with conceptual knowledge graphs; the open science movement has enriched this literature with complementary source code that implements scientific models. To exploit this new resource, we construct a knowledge graph using unsupervised learning methods to identify conceptual entities. We associate source code entities to these natural language concepts using word embedding and clustering techniques. Practical naming conventions for methods and functions tend to reflect the concept(s) they implement. We take advantage of this specificity by presenting a novel process for joint clustering text concepts that combines word-embeddings, nonlinear dimensionality reduction, and clustering techniques to assist in understanding, organizing, and comparing software in the open science ecosystem. With our pipeline, we aim to assist scientists in building on existing models in their discipline when making novel models for new phenomena. By combining source code and conceptual information, our knowledge graph enhances corpus-wide understanding of scientific literature.

* 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 15th International Workshop On Mining and Learning with Graphs 

  Access Paper or Ask Questions

An optimized system to solve text-based CAPTCHA

Jun 11, 2018
Ye Wang, Mi Lu

CAPTCHA(Completely Automated Public Turing test to Tell Computers and Humans Apart) can be used to protect data from auto bots. Countless kinds of CAPTCHAs are thus designed, while we most frequently utilize text-based scheme because of most convenience and user-friendly way \cite{bursztein2011text}. Currently, various types of CAPTCHAs need corresponding segmentation to identify single character due to the numerous different segmentation ways. Our goal is to defeat the CAPTCHA, thus firstly the CAPTCHAs need to be split into character by character. There isn't a regular segmentation algorithm to obtain the divided characters in all kinds of examples, which means that we have to treat the segmentation individually. In this paper, we build a whole system to defeat the CAPTCHAs as well as achieve state-of-the-art performance. In detail, we present our self-adaptive algorithm to segment different kinds of characters optimally, and then utilize both the existing methods and our own constructed convolutional neural network as an extra classifier. Results are provided showing how our system work well towards defeating these CAPTCHAs.

  Access Paper or Ask Questions

Categories of Emotion names in Web retrieved texts

Mar 11, 2012
Sergey Petrov, Jose F. Fontanari, Leonid I. Perlovsky

The categorization of emotion names, i.e., the grouping of emotion words that have similar emotional connotations together, is a key tool of Social Psychology used to explore people's knowledge about emotions. Without exception, the studies following that research line were based on the gauging of the perceived similarity between emotion names by the participants of the experiments. Here we propose and examine a new approach to study the categories of emotion names - the similarities between target emotion names are obtained by comparing the contexts in which they appear in texts retrieved from the World Wide Web. This comparison does not account for any explicit semantic information; it simply counts the number of common words or lexical items used in the contexts. This procedure allows us to write the entries of the similarity matrix as dot products in a linear vector space of contexts. The properties of this matrix were then explored using Multidimensional Scaling Analysis and Hierarchical Clustering. Our main findings, namely, the underlying dimension of the emotion space and the categories of emotion names, were consistent with those based on people's judgments of emotion names similarities.

* International Journal of Psychology and Behavioral Sciences 2 (2012) 173-184 

  Access Paper or Ask Questions

Comparing Text Representations: A Theory-Driven Approach

Sep 15, 2021
Gregory Yauney, David Mimno

Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compatibility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels.

* Published in EMNLP 2021 

  Access Paper or Ask Questions

Variational Autoencoders for Semi-supervised Text Classification

Nov 24, 2016
Weidi Xu, Haoze Sun, Chao Deng, Ying Tan

Although semi-supervised variational autoencoder (SemiVAE) works in image classification task, it fails in text classification task if using vanilla LSTM as its decoder. From a perspective of reinforcement learning, it is verified that the decoder's capability to distinguish between different categorical labels is essential. Therefore, Semi-supervised Sequential Variational Autoencoder (SSVAE) is proposed, which increases the capability by feeding label into its decoder RNN at each time-step. Two specific decoder structures are investigated and both of them are verified to be effective. Besides, in order to reduce the computational complexity in training, a novel optimization method is proposed, which estimates the gradient of the unlabeled objective function by sampling, along with two variance reduction techniques. Experimental results on Large Movie Review Dataset (IMDB) and AG's News corpus show that the proposed approach significantly improves the classification accuracy compared with pure-supervised classifiers, and achieves competitive performance against previous advanced methods. State-of-the-art results can be obtained by integrating other pretraining-based methods.

* 8 pages, 4 figure 

  Access Paper or Ask Questions