Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Deep Lexical Hypothesis: Identifying personality structure in natural language

Mar 04, 2022
Andrew Cutler, David M. Condon

Recent advances in natural language processing (NLP) have produced general models that can perform complex tasks such as summarizing long passages and translating across languages. Here, we introduce a method to extract adjective similarities from language models as done with survey-based ratings in traditional psycholexical studies but using millions of times more text in a natural setting. The correlational structure produced through this method is highly similar to that of self- and other-ratings of 435 terms reported by Saucier and Goldberg (1996a). The first three unrotated factors produced using NLP are congruent with those in survey data, with coefficients of 0.89, 0.79, and 0.79. This structure is robust to many modeling decisions: adjective set, including those with 1,710 terms (Goldberg, 1982) and 18,000 terms (Allport & Odbert, 1936); the query used to extract correlations; and language model. Notably, Neuroticism and Openness are only weakly and inconsistently recovered. This is a new source of signal that is closer to the original (semantic) vision of the Lexical Hypothesis. The method can be applied where surveys cannot: in dozens of languages simultaneously, with tens of thousands of items, on historical text, and at extremely large scale for little cost. The code is made public to facilitate reproduction and fast iteration in new directions of research.

  Access Paper or Ask Questions

Discourse Comprehension: A Question Answering Framework to Represent Sentence Connections

Nov 01, 2021
Wei-Jen Ko, Cutter Dalton, Mark Simmons, Eliza Fisher, Greg Durrett, Junyi Jessy Li

While there has been substantial progress in text comprehension through simple factoid question answering, more holistic comprehension of a discourse still presents a major challenge. Someone critically reflecting on a text as they read it will pose curiosity-driven, often open-ended questions, which reflect deep understanding of the content and require complex reasoning to answer. A key challenge in building and evaluating models for this type of discourse comprehension is the lack of annotated data, especially since finding answers to such questions (which may not be answered at all) requires high cognitive load for annotators over long documents. This paper presents a novel paradigm that enables scalable data collection targeting the comprehension of news documents, viewing these questions through the lens of discourse. The resulting corpus, DCQA (Discourse Comprehension by Question Answering), consists of 22,430 question-answer pairs across 607 English documents. DCQA captures both discourse and semantic links between sentences in the form of free-form, open-ended questions. On an evaluation set that we annotated on questions from the INQUISITIVE dataset, we show that DCQA provides valuable supervision for answering open-ended questions. We additionally design pre-training methods utilizing existing question-answering resources, and use synthetic data to accommodate unanswerable questions.

  Access Paper or Ask Questions

Identity Inference on Blockchain using Graph Neural Network

Apr 14, 2021
Jie Shen, Jiajun Zhou, Yunyi Xie, Shanqing Yu, Qi Xuan

The anonymity of blockchain has accelerated the growth of illegal activities and criminal behaviors on cryptocurrency platforms. Although decentralization is one of the typical characteristics of blockchain, we urgently call for effective regulation to detect these illegal behaviors to ensure the safety and stability of user transactions. Identity inference, which aims to make a preliminary inference about account identity, plays a significant role in blockchain security. As a common tool, graph mining technique can effectively represent the interactive information between accounts and be used for identity inference. However, existing methods cannot balance scalability and end-to-end architecture, resulting high computational consumption and weak feature representation. In this paper, we present a novel approach to analyze user's behavior from the perspective of the transaction subgraph, which naturally transforms the identity inference task into a graph classification pattern and effectively avoids computation in large-scale graph. Furthermore, we propose a generic end-to-end graph neural network model, named $\text{I}^2 \text{BGNN}$, which can accept subgraph as input and learn a function mapping the transaction subgraph pattern to account identity, achieving de-anonymization. Extensive experiments on EOSG and ETHG datasets demonstrate that the proposed method achieve the state-of-the-art performance in identity inference.

* Under review 

  Access Paper or Ask Questions

TypeNet: Deep Learning Keystroke Biometrics

Feb 18, 2021
Alejandro Acien, Aythami Morales, John V. Monaco, Ruben Vera-Rodriguez, Julian Fierrez

We study the performance of Long Short-Term Memory networks for keystroke biometric authentication at large scale in free-text scenarios. For this we introduce TypeNet, a Recurrent Neural Network (RNN) trained with a moderate number of keystrokes per identity. We evaluate different learning approaches depending on the loss function (softmax, contrastive, and triplet loss), number of gallery samples, length of the keystroke sequences, and device type (physical vs touchscreen keyboard). With 5 gallery sequences and test sequences of length 50, TypeNet achieves state-of-the-art keystroke biometric authentication performance with an Equal Error Rate of 2.2% and 9.2% for physical and touchscreen keyboards, respectively, significantly outperforming previous approaches. Our experiments demonstrate a moderate increase in error with up to 100,000 subjects, demonstrating the potential of TypeNet to operate at an Internet scale. We utilize two Aalto University keystroke databases, one captured on physical keyboards and the second on mobile devices (touchscreen keyboards). To the best of our knowledge, both databases are the largest existing free-text keystroke databases available for research with more than 136 million keystrokes from 168,000 subjects in physical keyboards, and 60,000 subjects with more than 63 million keystrokes acquired on mobile touchscreens.

* arXiv admin note: substantial text overlap with arXiv:2004.03627 

  Access Paper or Ask Questions

End to End ASR System with Automatic Punctuation Insertion

Dec 03, 2020
Yushi Guan

Recent Automatic Speech Recognition systems have been moving towards end-to-end systems that can be trained together. Numerous techniques that have been proposed recently enabled this trend, including feature extraction with CNNs, context capturing and acoustic feature modeling with RNNs, automatic alignment of input and output sequences using Connectionist Temporal Classifications, as well as replacing traditional n-gram language models with RNN Language Models. Historically, there has been a lot of interest in automatic punctuation in textual or speech to text context. However, there seems to be little interest in incorporating automatic punctuation into the emerging neural network based end-to-end speech recognition systems, partially due to the lack of English speech corpus with punctuated transcripts. In this study, we propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals. Combining Damerau Levenshtein Distance and slot error rate into DLev-SER, we enable measurement of punctuation error rate when the hypothesis text is not perfectly aligned with the reference. Compared with previous methods, our model reduces slot error rate from 0.497 to 0.341.

  Access Paper or Ask Questions

Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis

Nov 04, 2020
Sashi Novitasari, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

Even though over seven hundred ethnic languages are spoken in Indonesia, the available technology remains limited that could support communication within indigenous communities as well as with people outside the villages. As a result, indigenous communities still face isolation due to cultural barriers; languages continue to disappear. To accelerate communication, speech-to-speech translation (S2ST) technology is one approach that can overcome language barriers. However, S2ST systems require machine translation (MT), speech recognition (ASR), and synthesis (TTS) that rely heavily on supervised training and a broad set of language resources that can be difficult to collect from ethnic communities. Recently, a machine speech chain mechanism was proposed to enable ASR and TTS to assist each other in semi-supervised learning. The framework was initially implemented only for monolingual languages. In this study, we focus on developing speech recognition and synthesis for these Indonesian ethnic languages: Javanese, Sundanese, Balinese, and Bataks. We first separately train ASR and TTS of standard Indonesian in supervised training. We then develop ASR and TTS of ethnic languages by utilizing Indonesian ASR and TTS in a cross-lingual machine speech chain framework with only text or only speech data removing the need for paired speech-text data of those ethnic languages.

* Accepted in SLTU-CCURL 2020 

  Access Paper or Ask Questions

DeRPN: Taking a further step toward more general object detection

Nov 16, 2018
Lele Xie, Yuliang Liu, Lianwen Jin, Zecheng Xie

Most current detection methods have adopted anchor boxes as regression references. However, the detection performance is sensitive to the setting of the anchor boxes. A proper setting of anchor boxes may vary significantly across different datasets, which severely limits the universality of the detectors. To improve the adaptivity of the detectors, in this paper, we present a novel dimension-decomposition region proposal network (DeRPN) that can perfectly displace the traditional Region Proposal Network (RPN). DeRPN utilizes an anchor string mechanism to independently match object widths and heights, which is conducive to treating variant object shapes. In addition, a novel scale-sensitive loss is designed to address the imbalanced loss computations of different scaled objects, which can avoid the small objects being overwhelmed by larger ones. Comprehensive experiments conducted on both general object detection datasets (Pascal VOC 2007, 2012 and MS COCO) and scene text detection datasets (ICDAR 2013 and COCO-Text) all prove that our DeRPN can significantly outperform RPN. It is worth mentioning that the proposed DeRPN can be employed directly on different models, tasks, and datasets without any modifications of hyperparameters or specialized optimization, which further demonstrates its adaptivity. The code will be released at

* 8pages, 4 figures, 6 tables, accepted to appear in AAAI 2019 

  Access Paper or Ask Questions

CEVO: Comprehensive EVent Ontology Enhancing Cognitive Annotation

Oct 03, 2018
Saeedeh Shekarpour, Faisal Alshargi, Valerie Shalin, Krishnaprasad Thirunarayan, Amit P. Sheth

While the general analysis of named entities has received substantial research attention on unstructured as well as structured data, the analysis of relations among named entities has received limited focus. In fact, a review of the literature revealed a deficiency in research on the abstract conceptualization required to organize relations. We believe that such an abstract conceptualization can benefit various communities and applications such as natural language processing, information extraction, machine learning, and ontology engineering. In this paper, we present Comprehensive EVent Ontology (CEVO), built on Levin's conceptual hierarchy of English verbs that categorizes verbs with shared meaning, and syntactic behavior. We present the fundamental concepts and requirements for this ontology. Furthermore, we present three use cases employing the CEVO ontology on annotation tasks: (i) annotating relations in plain text, (ii) annotating ontological properties, and (iii) linking textual relations to ontological properties. These use-cases demonstrate the benefits of using CEVO for annotation: (i) annotating English verbs from an abstract conceptualization, (ii) playing the role of an upper ontology for organizing ontological properties, and (iii) facilitating the annotation of text relations using any underlying vocabulary. This resource is available at using namespace.

  Access Paper or Ask Questions

ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

May 20, 2016
Hossein Soleimani, David J. Miller

We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from

  Access Paper or Ask Questions