Alert button
Picture for Carlos Gómez-Rodríguez

Carlos Gómez-Rodríguez

Alert button

Assessment of Pre-Trained Models Across Languages and Grammars

Sep 20, 2023
Alberto Muñoz-Ortiz, David Vilares, Carlos Gómez-Rodríguez

We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.

* Accepted at IJCNLP-AACL 2023 
Viaarxiv icon

Revisiting Supertagging for HPSG

Sep 14, 2023
Olga Zamaraeva, Carlos Gómez-Rodríguez

We present new supertaggers trained on HPSG-based treebanks. These treebanks feature high-quality annotation based on a well-developed linguistic theory and include diverse and challenging test datasets, beyond the usual WSJ section 23 and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline. Our fine-tuned BERT-based tagger achieves 97.26% accuracy on 1000 sentences from WSJ23 and 93.88% on the completely out-of-domain The Cathedral and the Bazaar (cb)). We conclude that it therefore makes sense to integrate these new supertaggers into modern HPSG parsers, and we also hope that the diverse and difficult datasets we used here will gain more popularity in the field. We contribute the complete dataset reformatted for token classification.

* 9 pages, 0 figures 
Viaarxiv icon

Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts

Sep 11, 2023
Olga Kellert, Mahmud Uz Zaman, Nicholas Hill Matlis, Carlos Gómez-Rodríguez

Figure 1 for Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts
Figure 2 for Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts
Figure 3 for Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts
Figure 4 for Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts

This paper summarizes the results of experimenting with Universal Dependencies (UD) adaptation of an Unsupervised, Compositional and Recursive (UCR) rule-based approach for Sentiment Analysis (SA) submitted to the Shared Task at Rest-Mex 2023 (Team Olga/LyS-SALSA) (within the IberLEF 2023 conference). By using basic syntactic rules such as rules of modification and negation applied on words from sentiment dictionaries, our approach exploits some advantages of an unsupervised method for SA: (1) interpretability and explainability of SA, (2) robustness across datasets, languages and domains and (3) usability by non-experts in NLP. We compare our approach with other unsupervised approaches of SA that in contrast to our UCR rule-based approach use simple heuristic rules to deal with negation and modification. Our results show a considerable improvement over these approaches. We discuss future improvements of our results by using modality features as another shifting rule of polarity and word disambiguation techniques to identify the right sentiment words.

* Proceedings of IberLEF 2023, Ja\'en, Spain, 2023 
Viaarxiv icon

Contrasting Linguistic Patterns in Human and LLM-Generated Text

Aug 17, 2023
Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares

Figure 1 for Contrasting Linguistic Patterns in Human and LLM-Generated Text
Figure 2 for Contrasting Linguistic Patterns in Human and LLM-Generated Text
Figure 3 for Contrasting Linguistic Patterns in Human and LLM-Generated Text
Figure 4 for Contrasting Linguistic Patterns in Human and LLM-Generated Text

We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from 4 LLMs from the LLaMa family. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Among others, human texts exhibit more scattered sentence length distributions, a distinct use of dependency and constituent types, shorter constituents, and more aggressive emotions (fear, disgust) than LLM-generated texts. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs.

Viaarxiv icon

Parsing linearizations appreciate PoS tags - but some are fussy about errors

Oct 27, 2022
Alberto Muñoz-Ortiz, Mark Anderson, David Vilares, Carlos Gómez-Rodríguez

Figure 1 for Parsing linearizations appreciate PoS tags - but some are fussy about errors
Figure 2 for Parsing linearizations appreciate PoS tags - but some are fussy about errors
Figure 3 for Parsing linearizations appreciate PoS tags - but some are fussy about errors
Figure 4 for Parsing linearizations appreciate PoS tags - but some are fussy about errors

PoS tags, once taken for granted as a useful resource for syntactic parsing, have become more situational with the popularization of deep learning. Recent work on the impact of PoS tags on graph- and transition-based parsers suggests that they are only useful when tagging accuracy is prohibitively high, or in low-resource scenarios. However, such an analysis is lacking for the emerging sequence labeling parsing paradigm, where it is especially relevant as some models explicitly use PoS tags for encoding and decoding. We undertake a study and uncover some trends. Among them, PoS tags are generally more useful for sequence labeling parsers than for other paradigms, but the impact of their accuracy is highly encoding-dependent, with the PoS-based head-selection encoding being best only when both tagging accuracy and resource availability are high.

* Accepted at AACL 2022 
Viaarxiv icon

The Impact of Edge Displacement Vaserstein Distance on UD Parsing Performance

Sep 15, 2022
Mark Anderson, Carlos Gómez-Rodríguez

We contribute to the discussion on parsing performance in NLP by introducing a measurement that evaluates the differences between the distributions of edge displacement (the directed distance of edges) seen in training and test data. We hypothesize that this measurement will be related to differences observed in parsing performance across treebanks. We motivate this by building upon previous work and then attempt to falsify this hypothesis by using a number of statistical methods. We establish that there is a statistical correlation between this measurement and parsing performance even when controlling for potential covariants. We then use this to establish a sampling technique that gives us an adversarial and complementary split. This gives an idea of the lower and upper bounds of parsing systems for a given treebank in lieu of freshly sampled data. In a broader sense, the methodology presented here can act as a reference for future correlation-based exploratory work in NLP.

* Computational Linguistics, 48(3):517-554, 2022  
* This is the final peer-reviewed manuscript accepted for publication in Computational Linguistics. The journal version with the final editorial and typesetting changes is available open-access at https://doi.org/10.1162/coli_a_00440 
Viaarxiv icon

The Fragility of Multi-Treebank Parsing Evaluation

Sep 14, 2022
Iago Alonso-Alonso, David Vilares, Carlos Gómez-Rodríguez

Figure 1 for The Fragility of Multi-Treebank Parsing Evaluation
Figure 2 for The Fragility of Multi-Treebank Parsing Evaluation
Figure 3 for The Fragility of Multi-Treebank Parsing Evaluation
Figure 4 for The Fragility of Multi-Treebank Parsing Evaluation

Treebank selection for parsing evaluation and the spurious effects that might arise from a biased choice have not been explored in detail. This paper studies how evaluating on a single subset of treebanks can lead to weak conclusions. First, we take a few contrasting parsers, and run them on subsets of treebanks proposed in previous work, whose use was justified (or not) on criteria such as typology or data scarcity. Second, we run a large-scale version of this experiment, create vast amounts of random subsets of treebanks, and compare on them many parsers whose scores are available. The results show substantial variability across subsets and that although establishing guidelines for good treebank selection is hard, it is possible to detect potentially harmful strategies.

* Accepted at COLING 2022 
Viaarxiv icon

Cross-lingual Inflection as a Data Augmentation Method for Parsing

May 20, 2022
Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, David Vilares

Figure 1 for Cross-lingual Inflection as a Data Augmentation Method for Parsing
Figure 2 for Cross-lingual Inflection as a Data Augmentation Method for Parsing
Figure 3 for Cross-lingual Inflection as a Data Augmentation Method for Parsing
Figure 4 for Cross-lingual Inflection as a Data Augmentation Method for Parsing

We propose a morphology-based method for low-resource (LR) dependency parsing. We train a morphological inflector for target LR languages, and apply it to related rich-resource (RR) treebanks to create cross-lingual (x-inflected) treebanks that resemble the target LR language. We use such inflected treebanks to train parsers in zero- (training on x-inflected treebanks) and few-shot (training on x-inflected and target language treebanks) setups. The results show that the method sometimes improves the baselines, but not consistently.

* 10 pages, 7 tables, 5 figures. Workshop on Insights from Negative Results in NLP 2022 (co-located with ACL) 
Viaarxiv icon

A machine transliteration tool between Uzbek alphabets

May 19, 2022
Ulugbek Salaev, Elmurod Kuriyozov, Carlos Gómez-Rodríguez

Figure 1 for A machine transliteration tool between Uzbek alphabets
Figure 2 for A machine transliteration tool between Uzbek alphabets
Figure 3 for A machine transliteration tool between Uzbek alphabets
Figure 4 for A machine transliteration tool between Uzbek alphabets

Machine transliteration, as defined in this paper, is a process of automatically transforming written script of words from a source alphabet into words of another target alphabet within the same language, while preserving their meaning, as well as pronunciation. The main goal of this paper is to present a machine transliteration tool between three common scripts used in low-resource Uzbek language: the old Cyrillic, currently official Latin, and newly announced New Latin alphabets. The tool has been created using a combination of rule-based and fine-tuning approaches. The created tool is available as an open-source Python package, as well as a web-based application including a public API. To our knowledge, this is the first machine transliteration tool that supports the newly announced Latin alphabet of the Uzbek language.

* Preprint of a conference paper: The International Conference on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP) 
Viaarxiv icon