Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence

Jan 14, 2022
Ning Zhang, Mohammadreza Ebrahimi, Weifeng Li, Hsinchun Chen

Automated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise and leverages an enhanced character segmentation algorithm to handle CAPTCHA images with variable character length. Our proposed framework, DW-GAN, was systematically evaluated on multiple dark web CAPTCHA testbeds. DW-GAN significantly outperformed the state-of-the-art benchmark methods on all datasets, achieving over 94.4% success rate on a carefully collected real-world dark web dataset...

* Accepted by ACM Transactions on Management Information Systems 

  Access Paper or Ask Questions

Unsupervised Low-Dimensional Vector Representations for Words, Phrases and Text that are Transparent, Scalable, and produce Similarity Metrics that are Complementary to Neural Embeddings

Jan 09, 2018
Neil R. Smalheiser, Gary Bonifield

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = ~0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title+abstracts are all publicly available from for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

* 27 pages, 9 tables, and 6 supplemental files which can be accessed at Rewrote Introduction. This ms. has been submitted to J. Biomed. Informatics 

  Access Paper or Ask Questions

Structuring an unordered text document

Jan 29, 2019
Shashank Yadav, Tejas Shimpi, C. Ravindranath Chowdary, Prashant Sharma, Deepansh Agrawal, Shivang Agarwal

Segmenting an unordered text document into different sections is a very useful task in many text processing applications like multiple document summarization, question answering, etc. This paper proposes structuring of an unordered text document based on the keywords in the document. We test our approach on Wikipedia documents using both statistical and predictive methods such as the TextRank algorithm and Google's USE (Universal Sentence Encoder). From our experimental results, we show that the proposed model can effectively structure an unordered document into sections.

  Access Paper or Ask Questions

A Dynamic Programming Algorithm for the Segmentation of Greek Texts

Oct 21, 2003
Pavlina Fragkou

In this paper we introduce a dynamic programming algorithm to perform linear text segmentation by global minimization of a segmentation cost function which consists of: (a) within-segment word similarity and (b) prior information about segment length. The evaluation of the segmentation accuracy of the algorithm on a text collection consisting of Greek texts showed that the algorithm achieves high segmentation accuracy and appears to be very innovating and promissing.

* This paper will appear in the Proceedings of the CONSOLE XII Conference (Patras, Greece, 2003) 

  Access Paper or Ask Questions

On the nature of long-range letter correlations in texts

Aug 31, 2008
Dmitrii Y. Manin

The origin of long-range letter correlations in natural texts is studied using random walk analysis and Jensen-Shannon divergence. It is concluded that they result from slow variations in letter frequency distribution, which are a consequence of slow variations in lexical composition within the text. These correlations are preserved by random letter shuffling within a moving window. As such, they do reflect structural properties of the text, but in a very indirect manner.

* 14 pages, 5 figures, unpublished 

  Access Paper or Ask Questions

Text Classification using Capsules

Aug 14, 2018
Jaeyoung Kim, Sion Jang, Sungchul Choi, Eunjeong Park

This paper presents an empirical exploration of the use of capsule networks for text classification. While it has been shown that capsule networks are effective for image classification, their validity in the domain of text has not been explored. In this paper, we show that capsule networks indeed have the potential for text classification and that they have several advantages over convolutional neural networks. We further suggest a simple routing method that effectively reduces the computational complexity of dynamic routing. We utilized seven benchmark datasets to demonstrate that capsule networks, along with the proposed routing method provide comparable results.

  Access Paper or Ask Questions

Multi-Paragraph Segmentation of Expository Text

Jun 23, 1994
Marti A. Hearst

This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes. Two fully-implemented versions of the algorithm are described and shown to produce segmentation that corresponds well to human judgments of the major subtopic boundaries of thirteen lengthy texts.

* To Appear in ACL '94 Proceedings; 8 pages POSTSCRIPT format 

  Access Paper or Ask Questions

Identifying Populist Paragraphs in Text: A machine-learning approach

Jun 10, 2021
JogilÄ— UlinskaitÄ—, Lukas Pukelis

Abstract: In this paper we present an approach to develop a text-classification model which would be able to identify populist content in text. The developed BERT-based model is largely successful in identifying populist content in text and produces only a negligible amount of False Negatives, which makes it well-suited as a content analysis automation tool, which shortlists potentially relevant content for human validation.

* 18 pages, 2 Figures, 3 Tables in main text, 2 tables in Annexes 

  Access Paper or Ask Questions