Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Learning Video Representations from Textual Web Supervision

Jul 29, 2020
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid

Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We fine-tune the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pretraining video representations. Specifically, it leads to improvements over from-scratch training on all benchmarks, outperforms many methods for self-supervised and webly-supervised video representation learning, and achieves an improvement of 2.2% accuracy on HMDB-51.


  Access Paper or Ask Questions

A Scale-Space Theory for Text

Dec 10, 2012
Shuang-Hong Yang

Scale-space theory has been established primarily by the computer vision and signal processing communities as a well-founded and promising framework for multi-scale processing of signals (e.g., images). By embedding an original signal into a family of gradually coarsen signals parameterized with a continuous scale parameter, it provides a formal framework to capture the structure of a signal at different scales in a consistent way. In this paper, we present a scale space theory for text by integrating semantic and spatial filters, and demonstrate how natural language documents can be understood, processed and analyzed at multiple resolutions, and how this scale-space representation can be used to facilitate a variety of NLP and text analysis tasks.

* 9 pages, 6 figures; Nature language processing 

  Access Paper or Ask Questions

Towards Text-based Phishing Detection

Nov 03, 2021
Gilchan Park, Julia M. Taylor

This paper reports on an experiment into text-based phishing detection using readily available resources and without the use of semantics. The developed algorithm is a modified version of previously published work that works with the same tools. The results obtained in recognizing phishing emails are considerably better than the previously reported work; but the rate of text falsely identified as phishing is slightly worse. It is expected that adding semantic component will reduce the false positive rate while preserving the detection accuracy.

* Society for Design and Process Science (SDPS) 2013, pp.187-192. https://www.sdpsnet.org/sdps/documents/sdps-2013/SDPS_2013_proceedings.pdf 

  Access Paper or Ask Questions

Towards text-based phishing detection

Nov 02, 2021
Gilchan Park, Julia M. Taylor

This paper reports on an experiment into text-based phishing detection using readily available resources and without the use of semantics. The developed algorithm is a modified version of previously published work that works with the same tools. The results obtained in recognizing phishing emails are considerably better than the previously reported work; but the rate of text falsely identified as phishing is slightly worse. It is expected that adding semantic component will reduce the false positive rate while preserving the detection accuracy.

* Society for Design and Process Science (SDPS) 2013, pp.187-192 https://www.sdpsnet.org/sdps/documents/sdps-2013/SDPS_2013_proceedings.pdf 

  Access Paper or Ask Questions

Hamtajoo: A Persian Plagiarism Checker for Academic Manuscripts

Dec 27, 2021
Vahid Zarrabi, Salar Mohtaj, Habibollah Asghari

In recent years, due to the high availability of electronic documents through the Web, the plagiarism has become a serious challenge, especially among scholars. Various plagiarism detection systems have been developed to prevent text re-use and to confront plagiarism. Although it is almost easy to detect duplicate text in academic manuscripts, finding patterns of text re-use that has been semantically changed is of great importance. Another important issue is to deal with less resourced languages, which there are low volume of text for training purposes and also low performance in tools for NLP applications. In this paper, we introduce Hamtajoo, a Persian plagiarism detection system for academic manuscripts. Moreover, we describe the overall structure of the system along with the algorithms used in each stage. In order to evaluate the performance of the proposed system, we used a plagiarism detection corpus comply with the PAN standards.


  Access Paper or Ask Questions

Detecting Hate Speech with GPT-3

Mar 23, 2021
Ke-Li Chiu, Rohan Alexander

Sophisticated language models such as OpenAI's GPT-3 can generate hateful text that targets marginalized groups. Given this capacity, we are interested in whether large language models can be used to identify hate speech and classify text as sexist or racist? We use GPT-3 to identify sexist and racist text passages with zero-, one-, and few-shot learning. We find that with zero- and one-shot learning, GPT-3 is able to identify sexist or racist text with an accuracy between 48 per cent and 69 per cent. With few-shot learning and an instruction included in the prompt, the model's accuracy can be as high as 78 per cent. We conclude that large language models have a role to play in hate speech detection, and that with further development language models could be used to counter hate speech and even self-police.

* 15 pages, 1 figure, 8 tables 

  Access Paper or Ask Questions

Semi-Supervised Cleansing of Web Argument Corpora

Nov 03, 2020
Jonas Dorsch, Henning Wachsmuth

Debate portals and similar web platforms constitute one of the main text sources in computational argumentation research and its applications. While the corpora built upon these sources are rich of argumentatively relevant content and structure, they also include text that is irrelevant, or even detrimental, to their purpose. In this paper, we present a precision-oriented approach to detecting such irrelevant text in a semi-supervised way. Given a few seed examples, the approach automatically learns basic lexical patterns of relevance and irrelevance and then incrementally bootstraps new patterns from sentences matching the patterns. In the existing args.me corpus with 400k argumentative texts, our approach detects almost 87k irrelevant sentences, at a precision of 0.97 according to manual evaluation. With low effort, the approach can be adapted to other web argument corpora, providing a generic way to improve corpus quality.

* Accepted at ArgMining 2020 

  Access Paper or Ask Questions

CompLex --- A New Corpus for Lexical Complexity Predicition from Likert Scale Data

Mar 16, 2020
Matthew Shardlow, Michael Cooper, Marcos Zampieri

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.


  Access Paper or Ask Questions

The Woman Worked as a Babysitter: On Biases in Language Generation

Sep 03, 2019
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng

We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.

* EMNLP 2019 short paper (5 pages) 

  Access Paper or Ask Questions

Cross-referencing using Fine-grained Topic Modeling

May 18, 2019
Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Emily Hales, Kevin Seppi

Cross-referencing, which links passages of text to other related passages, can be a valuable study aid for facilitating comprehension of a text. However, cross-referencing requires first, a comprehensive thematic knowledge of the entire corpus, and second, a focused search through the corpus specifically to find such useful connections. Due to this, cross-reference resources are prohibitively expensive and exist only for the most well-studied texts (e.g. religious texts). We develop a topic-based system for automatically producing candidate cross-references which can be easily verified by human annotators. Our system utilizes fine-grained topic modeling with thousands of highly nuanced and specific topics to identify verse pairs which are topically related. We demonstrate that our system can be cost effective compared to having annotators acquire the expertise necessary to produce cross-reference resources unaided.

* 6 figures 1 table 8 pages 

  Access Paper or Ask Questions

<<
290
291
292
293
294
295
296
297
298
299
300
301
302
>>