Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eric Laporte

LIGM

A French Corpus Annotated for Multiword Expressions with Adverbial Function

Jun 03, 2026

Eric Laporte, Takuya Nakamura, Stavroula Voyatzi

Abstract:This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license.

* Language Resources and Evaluation Conference (LREC), Linguistic Annotation Workshop, 2008, Marrakech, Morocco, pp.48-51

Via

Access Paper or Ask Questions

Lexicons and grammars for language processing: industrial or handcrafted products?

Jun 02, 2026

Eric Laporte

Abstract:During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

* Léxico e gramática: dos sentidos à construção da significação, Cultura acadêmica, 2009, Trilhas Lingüísticas, 16, pp.51-84

Via

Access Paper or Ask Questions

French parsing enhanced with a word clustering method based on a syntactic lexicon

May 30, 2026

Anthony Sigogne, Matthieu Constant, Eric Laporte

Abstract:This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeillé et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006).

* Second Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), 2011, Dublin, Ireland, pp.22-27

Via

Access Paper or Ask Questions

Classification of non-analyzable word types in web documents to implement an effective Korean e-learning system

May 28, 2026

Sang-Taek Park, Ae-Lim Ahn, Eric Laporte, Jee-Sun Nam

Abstract:E-learning systems should deliver contents that reflect various phenomena of the language as it is used. In addition to formal Korean, e-learning systems that would include real-world Korean expressions such as those in web documents, mobile text messages, or twitter posts, would be useful to high-level learners. We construct two types of corpora: one is made of formal documents like online news articles; the other is made of informal documents like customer reviews about new products in web blogs. By comparing these corpora, we show how expressions differ in these two types of corpora. We survey the main characteristics of the informal corpus. Given that a significant proportion of text is informal, we propose Local Grammar Graphs (LGG) as an appropriate model to treat them effectively in Korean e-learning systems.

* Doing Research in Applied Linguistics, 2011, pp. 61-68

Via

Access Paper or Ask Questions

A new semantically annotated corpus with syntactic-semantic and cross-lingual senses

May 27, 2026

Myriam Rakho, Eric Laporte, Matthieu Constant

Abstract:We describe a new sense-tagged corpus for word sense disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.

* Language Resources and Evaluation (LREC), 2012, Istanbul, Turkey, pp.597-600

Via

Access Paper or Ask Questions

Formalization of Malagasy conjugation

May 26, 2026

Joro Ny Aina Ranaivoarison, Eric Laporte, Baholisoa Simone Ralalaoherivony

Abstract:This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.

* Language and Technology Conference, 2013, Poznań, Poland, pp.457-462

Via

Access Paper or Ask Questions

Pattern-and-root inflectional morphology: the Arabic broken plural

May 21, 2026

Alexis Amid Neme, Eric Laporte

Abstract:We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.

* Language Sciences, 2013, 40, pp.221-250

Via

Access Paper or Ask Questions

Conversion of Lexicon-Grammar tables to LMF. Application to French

May 14, 2026

Eric Laporte, Elsa Tolone, Mathieu Constant

Abstract:We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

* LMF. Lexical Markup Framework, 2013, ISTE - Wiley, pp.157-187

Via

Access Paper or Ask Questions

Choosing features for classifying multiword expressions

May 12, 2026

Eric Laporte

Abstract:Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.

* Multiword expressions: Insights from a multi-lingual perspective, 2018, Language Science Press, pp.143-186

Via

Access Paper or Ask Questions

Concordance Comparison as a Means of Assembling Local Grammars

May 12, 2026

Juliana Pirovani, Elias de Oliveira, Eric Laporte

Abstract:Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.

* Computational Processing of the Portuguese Language. 13th International Conference, PROPOR, Canela, Brazil, September 24-26, 2018, Proceedings, 11122, Springer, pp.57-65, Lecture Notes in Artificial Intelligence

Via

Access Paper or Ask Questions