Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Machine Learning with Lexical Features: The Duluth Approach to Senseval-2

May 27, 2002
Ted Pedersen

This paper describes the sixteen Duluth entries in the Senseval-2 comparative exercise among word sense disambiguation systems. There were eight pairs of Duluth systems entered in the Spanish and English lexical sample tasks. These are all based on standard machine learning algorithms that induce classifiers from sense-tagged training text where the context in which ambiguous words occur are represented by simple lexical features. These are highly portable, robust methods that can serve as a foundation for more tailored approaches.

* Appears in the Proceedings of SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Systems July 5-6, 2001, Toulouse, France 

  Access Paper or Ask Questions

Stacking classifiers for anti-spam filtering of e-mail

Jun 19, 2001
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, P. Stamatopoulos

We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or "spam", floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.

* Proceedings of "Empirical Methods in Natural Language Processing" (EMNLP 2001), L. Lee and D. Harman (Eds.), pp. 44-50, Carnegie Mellon University, Pittsburgh, PA, 2001 

  Access Paper or Ask Questions

Automatic Extraction of Subcategorization Frames for Czech

Sep 08, 2000
Anoop Sarkar, Daniel Zeman

We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we ar able to achieve 88% precision on unseen parsed text.

* Proceedings of the 18th International Conference on Computational Linguistics (Coling 2000), Universit 
* 7 pages. Another version under the name "Learning Verb Subcategorization from Corpora: Counting Frame Subsets", authors: Zeman, Sarkar, in proceedings of LREC 2000, Athens, Greece 

  Access Paper or Ask Questions

Generic and Trend-aware Curriculum Learning for Relation Extraction in Graph Neural Networks

May 17, 2022
Nidhi Vakil, Hadi Amiri

We present a generic and trend-aware curriculum learning approach for graph neural networks. It extends existing approaches by incorporating sample-level loss trends to better discriminate easier from harder samples and schedule them for training. The model effectively integrates textual and structural information for relation extraction in text graphs. Experimental results show that the model provides robust estimations of sample difficulty and shows sizable improvement over the state-of-the-art approaches across several datasets.

* Long paper accepted at NAACL 2022 

  Access Paper or Ask Questions

CREER: A Large-Scale Corpus for Relation Extraction and Entity Recognition

Apr 27, 2022
Yu-Siou Tang, Chung-Hsien Wu

We describe the design and use of the CREER dataset, a large corpus annotated with rich English grammar and semantic attributes. The CREER dataset uses the Stanford CoreNLP Annotator to capture rich language structures from Wikipedia plain text. This dataset follows widely used linguistic and semantic annotations so that it can be used for not only most natural language processing tasks but also scaling the dataset. This large supervised dataset can serve as the basis for improving the performance of NLP tasks in the future.

  Access Paper or Ask Questions

Towards Lithuanian grammatical error correction

Mar 18, 2022
Lukas Stankevičius, Mantas LukoŔevičius

Everyone wants to write beautiful and correct text, yet the lack of language skills, experience, or hasty typing can result in errors. By employing the recent advances in transformer architectures, we construct a grammatical error correction model for Lithuanian, the language rich in archaic features. We compare subword and byte-level approaches and share our best trained model, achieving F$_{0.5}$=0.92, and accompanying code, in an online open-source repository.

  Access Paper or Ask Questions

LSH methods for data deduplication in a Wikipedia artificial dataset

Dec 10, 2021
Juan Ciro, Daniel Galvez, Tim Schlippe, David Kanter

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.

  Access Paper or Ask Questions

BCH-NLP at BioCreative VII Track 3: medications detection in tweets using transformer networks and multi-task learning

Nov 26, 2021
Dongfang Xu, Shan Chen, Timothy Miller

In this paper, we present our work participating in the BioCreative VII Track 3 - automatic extraction of medication names in tweets, where we implemented a multi-task learning model that is jointly trained on text classification and sequence labelling. Our best system run achieved a strict F1 of 80.4, ranking first and more than 10 points higher than the average score of all participants. Our analyses show that the ensemble technique, multi-task learning, and data augmentation are all beneficial for medication detection in tweets.

* Proceedings of the seventh BioCreative challenge evaluation workshop (

  Access Paper or Ask Questions

Interpretive Blindness

Oct 19, 2021
Nicholas Asher, Julie Hunter

We model here an epistemic bias we call \textit{interpretive blindness} (IB). IB is a special problem for learning from testimony, in which one acquires information only from text or conversation. We show that IB follows from a co-dependence between background beliefs and interpretation in a Bayesian setting and the nature of contemporary testimony. We argue that a particular characteristic contemporary testimony, \textit{argumentative completeness}, can preclude learning in hierarchical Bayesian settings, even in the presence of constraints that are designed to promote good epistemic practices.

  Access Paper or Ask Questions

Spanish Language Models

Aug 13, 2021
Asier GutiƩrrez-FandiƱo, Jordi Armengol-EstapƩ, Marc PƠmies, Joan Llop-Palao, Joaquƭn Silveira-Ocampo, Casimiro Pio Carrino, Aitor Gonzalez-Agirre, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019. We extended the current evaluation datasets with an extractive Question Answering dataset and our models outperform the existing Spanish models across tasks and settings.

  Access Paper or Ask Questions