Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Walter Daelemans

Tilburg University

Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

Jan 13, 2016

Vincent Van Asch, Walter Daelemans

Figure 1 for Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

Figure 2 for Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

Figure 3 for Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

Figure 4 for Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

Abstract:The goal of this paper is to investigate the connection between the performance gain that can be obtained by selftraining and the similarity between the corpora used in this approach. Self-training is a semi-supervised technique designed to increase the performance of machine learning algorithms by automatically classifying instances of a task and adding these as additional training material to the same classifier. In the context of language processing tasks, this training material is mostly an (annotated) corpus. Unfortunately self-training does not always lead to a performance increase and whether it will is largely unpredictable. We show that the similarity between corpora can be used to identify those setups for which self-training can be beneficial. We consider this research as a step in the process of developing a classifier that is able to adapt itself to each new test corpus that it is presented with.

Via

Access Paper or Ask Questions

The Effects of Age, Gender and Region on Non-standard Linguistic Variation in Online Social Networks

Jan 11, 2016

Claudia Peersman, Walter Daelemans, Reinhild Vandekerckhove, Bram Vandekerckhove, Leona Van Vaerenbergh

Figure 1 for The Effects of Age, Gender and Region on Non-standard Linguistic Variation in Online Social Networks

Figure 2 for The Effects of Age, Gender and Region on Non-standard Linguistic Variation in Online Social Networks

Figure 3 for The Effects of Age, Gender and Region on Non-standard Linguistic Variation in Online Social Networks

Figure 4 for The Effects of Age, Gender and Region on Non-standard Linguistic Variation in Online Social Networks

Abstract:We present a corpus-based analysis of the effects of age, gender and region of origin on the production of both "netspeak" or "chatspeak" features and regional speech features in Flemish Dutch posts that were collected from a Belgian online social network platform. The present study shows that combining quantitative and qualitative approaches is essential for understanding non-standard linguistic variation in a CMC corpus. It also presents a methodology that enables the systematic study of this variation by including all non-standard words in the corpus. The analyses resulted in a convincing illustration of the Adolescent Peak Principle. In addition, our approach revealed an intriguing correlation between the use of regional speech features and chatspeak features.

Via

Access Paper or Ask Questions

Meta-Learning for Phonemic Annotation of Corpora

Aug 18, 2000

Veronique Hoste, Walter Daelemans, Erik Tjong Kim Sang, Steven Gillis

Figure 1 for Meta-Learning for Phonemic Annotation of Corpora

Figure 2 for Meta-Learning for Phonemic Annotation of Corpora

Figure 3 for Meta-Learning for Phonemic Annotation of Corpora

Figure 4 for Meta-Learning for Phonemic Annotation of Corpora

Abstract:We apply rule induction, classifier combination and meta-learning (stacked classifiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reflecting the Flemish and Dutch pronunciations of a word on the basis of its orthographic representation (which in turn is based on the actual speech recordings). We compare several possible approaches to achieve the text-to-pronunciation mapping task: memory-based learning, transformation-based learning, rule induction, maximum entropy modeling, combination of classifiers in stacked learning, and stacking of meta-learners. We are interested both in optimal accuracy and in obtaining insight into the linguistic regularities involved. As far as accuracy is concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at word level) for single classifiers is boosted significantly with additional error reductions of 31% and 38% respectively using combination of classifiers, and a further 5% using combination of meta-learners, bringing overall word level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We also show that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.

* Proceedings of ICML-2000, Stanford University, CA, USA
* 8 pages

Via

Access Paper or Ask Questions

Applying System Combination to Base Noun Phrase Identification

Aug 17, 2000

Erik F. Tjong Kim Sang, Walter Daelemans, Herve Dejean, Rob Koeling, Yuval Krymolowski, Vasin Punyakanok, Dan Roth

Figure 1 for Applying System Combination to Base Noun Phrase Identification

Figure 2 for Applying System Combination to Base Noun Phrase Identification

Figure 3 for Applying System Combination to Base Noun Phrase Identification

Abstract:We use seven machine learning algorithms for one task: identifying base noun phrases. The results have been processed by different system combination methods and all of these outperformed the best individual result. We have applied the seven learners with the best combinator, a majority vote of the top five systems, to a standard data set and managed to improve the best published result for this data set.

* Proceedings of COLING 2000, Saarbruecken, Germany
* 7 pages

Via

Access Paper or Ask Questions

Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Jul 13, 2000

Jakub Zavrel, Walter Daelemans

Figure 1 for Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Figure 2 for Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Figure 3 for Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Figure 4 for Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Abstract:This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.

* Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), pp. 17--20
* 4 pages

Via

Access Paper or Ask Questions

Memory-Based Shallow Parsing

Jun 02, 1999

Walter Daelemans, Sabine Buchholz, Jorn Veenstra

Figure 1 for Memory-Based Shallow Parsing

Figure 2 for Memory-Based Shallow Parsing

Figure 3 for Memory-Based Shallow Parsing

Figure 4 for Memory-Based Shallow Parsing

Abstract:We present a memory-based learning (MBL) approach to shallow parsing in which POS tagging, chunking, and identification of syntactic relations are formulated as memory-based modules. The experiments reported in this paper show competitive results, the F-value for the Wall Street Journal (WSJ) treebank is: 93.8% for NP chunking, 94.7% for VP chunking, 77.1% for subject detection and 79.0% for object detection.

* 8 pages, to appear in: Proceedings of the EACL'99 workshop on Computational Natural Language Learning (CoNLL-99), Bergen, Norway, June 1999

Via

Access Paper or Ask Questions

Cascaded Grammatical Relation Assignment

Jun 02, 1999

Sabine Buchholz, Jorn Veenstra, Walter Daelemans

Figure 1 for Cascaded Grammatical Relation Assignment

Figure 2 for Cascaded Grammatical Relation Assignment

Figure 3 for Cascaded Grammatical Relation Assignment

Figure 4 for Cascaded Grammatical Relation Assignment

Abstract:In this paper we discuss cascaded Memory-Based grammatical relations assignment. In the first stages of the cascade, we find chunks of several types (NP,VP,ADJP,ADVP,PP) and label them with their adverbial function (e.g. local, temporal). In the last stage, we assign grammatical relations to pairs of chunks. We studied the effect of adding several levels to this cascaded classifier and we found that even the less performing chunkers enhanced the performance of the relation finder.

* 8 pages, to appear in: proceedings of EMNLP/VLC-99, University of Maryland, USA, June 21-22, 1999

Via

Access Paper or Ask Questions

Forgetting Exceptions is Harmful in Language Learning

Dec 22, 1998

Walter Daelemans, Antal van den Bosch, Jakub Zavrel

Figure 1 for Forgetting Exceptions is Harmful in Language Learning

Figure 2 for Forgetting Exceptions is Harmful in Language Learning

Abstract:We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.

* 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex styles. Pre-print version of article to appear in Machine Learning 11:1-3, Special Issue on Natural Language Learning. Figures on page 22 slightly compressed to avoid page overload

Via

Access Paper or Ask Questions

Improving Data Driven Wordclass Tagging by System Combination

Jul 31, 1998

Hans van Halteren, Jakub Zavrel, Walter Daelemans

Figure 1 for Improving Data Driven Wordclass Tagging by System Combination

Figure 2 for Improving Data Driven Wordclass Tagging by System Combination

Abstract:In this paper we examine how the differences in modelling between different data driven systems performing the same NLP task can be exploited to yield a higher accuracy than the best individual system. We do this by means of an experiment involving the task of morpho-syntactic wordclass tagging. Four well-known tagger generators (Hidden Markov Model, Memory-Based, Transformation Rules and Maximum Entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second stage classifiers. All combination taggers outperform their best component, with the best combination showing a 19.1% lower error rate than the best individual tagger.

* Proceedings of the 17th International Conference on Computational Linguistics (COLING-ACL'98)
* 7 pages, LaTeX, uses acl.bst, colacl.sty

Via

Access Paper or Ask Questions

Modularity in inductively-learned word pronunciation systems

Jan 26, 1998

Antal van den Bosch, Ton Weijters, Walter Daelemans

Abstract:In leading morpho-phonological theories and state-of-the-art text-to-speech systems it is assumed that word pronunciation cannot be learned or performed without in-between analyses at several abstraction levels (e.g., morphological, graphemic, phonemic, syllabic, and stress levels). We challenge this assumption for the case of English word pronunciation. Using IGTree, an inductive-learning decision-tree algorithms, we train and test three word-pronunciation systems in which the number of abstraction levels (implemented as sequenced modules) is reduced from five, via three, to one. The latter system, classifying letter strings directly as mapping to phonemes with stress markers, yields significantly better generalisation accuracies than the two multi-module systems. Analyses of empirical results indicate that positive utility effects of sequencing modules are outweighed by cascading errors passed on between modules.

* Proceedings of NeMLaP3/CoNLL98, 185-194
* 10 pages, uses nemlap3.sty and epsf and ipamacs (WSU IPA) macros

Via

Access Paper or Ask Questions