Alert button
Picture for David Samuel

David Samuel

Alert button

NoCoLA: The Norwegian Corpus of Linguistic Acceptability

Jun 13, 2023
Matias Jentoft, David Samuel

Figure 1 for NoCoLA: The Norwegian Corpus of Linguistic Acceptability
Figure 2 for NoCoLA: The Norwegian Corpus of Linguistic Acceptability
Figure 3 for NoCoLA: The Norwegian Corpus of Linguistic Acceptability
Figure 4 for NoCoLA: The Norwegian Corpus of Linguistic Acceptability

While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.

* Published at NoDaLiDa 2023 
Viaarxiv icon

Tokenization with Factorized Subword Encoding

Jun 13, 2023
David Samuel, Lilja Øvrelid

Figure 1 for Tokenization with Factorized Subword Encoding
Figure 2 for Tokenization with Factorized Subword Encoding
Figure 3 for Tokenization with Factorized Subword Encoding
Figure 4 for Tokenization with Factorized Subword Encoding

In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on language modeling and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.

* Findings of ACL 2023 
Viaarxiv icon

NorBench -- A Benchmark for Norwegian Language Models

May 06, 2023
David Samuel, Andrey Kutuzov, Samia Touileb, Erik Velldal, Lilja Øvrelid, Egil Rønningstad, Elina Sigdel, Anna Palatkina

Figure 1 for NorBench -- A Benchmark for Norwegian Language Models
Figure 2 for NorBench -- A Benchmark for Norwegian Language Models
Figure 3 for NorBench -- A Benchmark for Norwegian Language Models
Figure 4 for NorBench -- A Benchmark for Norwegian Language Models

We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.

* Accepted to NoDaLiDa 2023 
Viaarxiv icon

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

Apr 19, 2023
Lucas Georges Gabriel Charpentier, Sondre Wold, David Samuel, Egil Rønningstad

Figure 1 for BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
Figure 2 for BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
Figure 3 for BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
Figure 4 for BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

Retrieval-based language models are increasingly employed in question-answering tasks. These models search in a corpus of documents for relevant information instead of having all factual knowledge stored in its parameters, thereby enhancing efficiency, transparency, and adaptability. We develop the first Norwegian retrieval-based model by adapting the REALM framework and evaluating it on various tasks. After training, we also separate the language model, which we call the reader, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks. Results show that retrieval augmented language modeling improves the reader's performance on extractive question-answering, suggesting that this type of training improves language models' general ability to use context and that this does not happen at the expense of other abilities such as part-of-speech tagging, dependency parsing, named entity recognition, and lemmatization. Code, trained models, and data are made publicly available.

* Accepted for NoDaLiDa 2023, main conference 
Viaarxiv icon

Trained on 100 million words and still in shape: BERT meets British National Corpus

Mar 29, 2023
David Samuel, Andrey Kutuzov, Lilja Øvrelid, Erik Velldal

Figure 1 for Trained on 100 million words and still in shape: BERT meets British National Corpus
Figure 2 for Trained on 100 million words and still in shape: BERT meets British National Corpus
Figure 3 for Trained on 100 million words and still in shape: BERT meets British National Corpus
Figure 4 for Trained on 100 million words and still in shape: BERT meets British National Corpus

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

* Accepted to EACL 2023 
Viaarxiv icon

EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction

Oct 18, 2022
Huiling You, David Samuel, Samia Touileb, Lilja Øvrelid

Figure 1 for EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction
Figure 2 for EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction
Figure 3 for EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction
Figure 4 for EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction

This paper presents our submission to the 2022 edition of the CASE 2021 shared task 1, subtask 4. The EventGraph system adapts an end-to-end, graph-based semantic parser to the task of Protest Event Extraction and more specifically subtask 4 on event trigger and argument extraction. We experiment with various graphs, encoding the events as either "labeled-edge" or "node-centric" graphs. We show that the "node-centric" approach yields best results overall, performing well across the three languages of the task, namely English, Spanish, and Portuguese. EventGraph is ranked 3rd for English and Portuguese, and 4th for Spanish. Our code is available at: https://github.com/huiling-y/eventgraph_at_case

Viaarxiv icon

EventGraph: Event Extraction as Semantic Graph Parsing

Oct 16, 2022
Huiling You, David Samuel, Samia Touileb, Lilja Øvrelid

Figure 1 for EventGraph: Event Extraction as Semantic Graph Parsing
Figure 2 for EventGraph: Event Extraction as Semantic Graph Parsing
Figure 3 for EventGraph: Event Extraction as Semantic Graph Parsing
Figure 4 for EventGraph: Event Extraction as Semantic Graph Parsing

Event extraction involves the detection and extraction of both the event triggers and corresponding event arguments. Existing systems often decompose event extraction into multiple subtasks, without considering their possible interactions. In this paper, we propose EventGraph, a joint framework for event extraction, which encodes events as graphs. We represent event triggers and arguments as nodes in a semantic graph. Event extraction therefore becomes a graph parsing problem, which provides the following advantages: 1) performing event detection and argument extraction jointly; 2) detecting and extracting multiple events from a piece of text; and 3) capturing the complicated interaction between event arguments and triggers. Experimental results on ACE2005 show that our model is competitive to state-of-the-art systems and has substantially improved the results on argument extraction. Additionally, we create two new datasets from ACE2005 where we keep the entire text spans for event arguments, instead of just the head word(s). Our code and models are released as open-source.

* Accepted by CASE@EMNLP 2022 
Viaarxiv icon

Direct parsing to sentiment graphs

Mar 24, 2022
David Samuel, Jeremy Barnes, Robin Kurtz, Stephan Oepen, Lilja Øvrelid, Erik Velldal

Figure 1 for Direct parsing to sentiment graphs
Figure 2 for Direct parsing to sentiment graphs
Figure 3 for Direct parsing to sentiment graphs
Figure 4 for Direct parsing to sentiment graphs

This paper demonstrates how a graph-based semantic parser can be applied to the task of structured sentiment analysis, directly predicting sentiment graphs from text. We advance the state of the art on 4 out of 5 standard benchmark sets. We release the source code, models and predictions.

* Accepted to ACL 2022 
Viaarxiv icon

ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5

Nov 17, 2021
David Samuel, Milan Straka

Figure 1 for ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5
Figure 2 for ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5
Figure 3 for ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5
Figure 4 for ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.

* Accepted to W-NUT 2021 
Viaarxiv icon