Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nikolaos Aletras

Self-calibration for Language Model Quantization and Pruning

Oct 22, 2024

Miles Williams, George Chrysostomou, Nikolaos Aletras

Figure 1 for Self-calibration for Language Model Quantization and Pruning

Figure 2 for Self-calibration for Language Model Quantization and Pruning

Figure 3 for Self-calibration for Language Model Quantization and Pruning

Figure 4 for Self-calibration for Language Model Quantization and Pruning

Abstract:Quantization and pruning are fundamental approaches for model compression, enabling efficient inference for language models. In a post-training setting, state-of-the-art quantization and pruning methods require calibration data, a small set of unlabeled examples. Conventionally, randomly sampled web text is used, aiming to reflect the model training data. However, this poses two key problems: (1) unrepresentative calibration examples can harm model performance, and (2) organizations increasingly avoid releasing model training data. In this paper, we propose self-calibration as a solution. Our approach requires no external data, instead leveraging the model itself to generate synthetic calibration data as a better approximation of the pre-training data distribution. We extensively compare the performance of self-calibration with several baselines, across a variety of models, compression methods, and tasks. Our approach proves consistently competitive in maximizing downstream task performance, frequently outperforming even using real data.

* Work in progress

Via

Access Paper or Ask Questions

Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Oct 04, 2024

Yida Mu, Mali Jin, Xingyi Song, Nikolaos Aletras

Figure 1 for Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Figure 2 for Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Figure 3 for Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Figure 4 for Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Abstract:Research in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms. This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally, we propose new protocols and best practices for improving dataset development from social media data and its usage.

* Accepted at EMNLP 2024 Main

Via

Access Paper or Ask Questions

Vocabulary Expansion for Low-resource Cross-lingual Transfer

Jun 17, 2024

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

Abstract:Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers, vocabulary, and pre-training data, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, the majority of previous work has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion for LLMs in low-resource settings (i.e. languages and compute) has yet to be explored. In this paper, we investigate sample-efficient adaptation strategies from different angles, including target vocabulary size and initialization methods, and the amount of target data available for adaptation. Extensive experiments across typologically diverse languages, tasks and models show that simpler heuristic-based embedding initialization is more efficient and robust to changes in target vocabulary size and adaptation data in low-resource settings, outperforming a popular random initialization and a more sophisticated state-of-the-art approach that relies on external data and model.

Via

Access Paper or Ask Questions

Who is bragging more online? A large scale analysis of bragging in social media

Mar 25, 2024

Mali Jin, Daniel Preoţiuc-Pietro, A. Seza Doğruöz, Nikolaos Aletras

Figure 1 for Who is bragging more online? A large scale analysis of bragging in social media

Figure 2 for Who is bragging more online? A large scale analysis of bragging in social media

Figure 3 for Who is bragging more online? A large scale analysis of bragging in social media

Figure 4 for Who is bragging more online? A large scale analysis of bragging in social media

Abstract:Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging online and its characteristics. This paper employs computational sociolinguistics methods to conduct the first large scale study of bragging behavior on Twitter (U.S.) by focusing on its overall prevalence, temporal dynamics and impact of demographic factors. Our study shows that the prevalence of bragging decreases over time within the same population of users. In addition, younger, more educated and popular users in the U.S. are more likely to brag. Finally, we conduct an extensive linguistics analysis to unveil specific bragging themes associated with different user traits.

* Accepted at LREC-COLING 2024

Via

Access Paper or Ask Questions

Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models

Mar 19, 2024

Zhixue Zhao, Nikolaos Aletras

Abstract:In many real natural language processing application scenarios, practitioners not only aim to maximize predictive performance but also seek faithful explanations for the model predictions. Rationales and importance distribution given by feature attribution methods (FAs) provide insights into how different parts of the input contribute to a prediction. Previous studies have explored how different factors affect faithfulness, mainly in the context of monolingual English models. On the other hand, the differences in FA faithfulness between multilingual and monolingual models have yet to be explored. Our extensive experiments, covering five languages and five popular FAs, show that FA faithfulness varies between multilingual and monolingual models. We find that the larger the multilingual model, the less faithful the FAs are compared to its counterpart monolingual models.Our further analysis shows that the faithfulness disparity is potentially driven by the differences between model tokenizers. Our code is available: https://github.com/casszhao/multilingual-faith.

* Accepted at NAACL 2024 Main Conference

Via

Access Paper or Ask Questions

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference

Feb 16, 2024

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

Abstract:The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual vocabulary adaptation methods have been proposed for adapting models to a target language aiming to improve downstream performance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this paper, we perform an empirical study of various cross-lingual vocabulary adaptation methods on five generative LLMs (including monolingual and multilingual models) across four typologically-diverse languages and four natural language understanding tasks. We find that cross-lingual vocabulary adaptation substantially contributes to LLM inference speedups of up to 271.5%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.

Via

Access Paper or Ask Questions

We Need to Talk About Classification Evaluation Metrics in NLP

Jan 08, 2024

Peter Vickers, Loïc Barrault, Emilio Monti, Nikolaos Aletras

Abstract:In Natural Language Processing (NLP) classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. This lack suggests there has not been sufficient examination of the underlying heuristics which each metric encodes. To address this we compare several standard classification metrics with more 'exotic' metrics and demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance. To show how important the choice of metric is, we perform extensive experiments on a wide range of NLP tasks including a synthetic scenario, natural language understanding, question answering and machine translation. Across these tasks we use a superset of metrics to rank models and find that Informedness best captures the ideal model characteristics. Finally, we release a Python implementation of Informedness following the SciKitLearn classifier format.

* Appeared in AACL 2023

Via

Access Paper or Ask Questions

How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

Nov 16, 2023

Miles Williams, Nikolaos Aletras

Figure 1 for How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

Figure 2 for How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

Figure 3 for How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

Figure 4 for How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?

Abstract:Pruning and quantization form the foundation of model compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated state-of-the-art performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples, to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of pruning and quantization methods, tasks, models, and datasets. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.

Via

Access Paper or Ask Questions

Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization

Nov 15, 2023

George Chrysostomou, Zhixue Zhao, Miles Williams, Nikolaos Aletras

Abstract:Despite their remarkable performance on abstractive summarization, large language models (LLMs) face two significant challenges: their considerable size and tendency to hallucinate. Hallucinations are concerning because they erode the reliability of LLMs and raise safety issues. Pruning is a technique that reduces model size by removing redundant weights to create sparse models that enable more efficient inference. Pruned models yield comparable performance to their counterpart full-sized models, making them ideal alternatives when operating on a limited budget. However, the effect that pruning has upon hallucinations in abstractive summarization with LLMs has yet to be explored. In this paper, we provide an extensive empirical study on the hallucinations produced by pruned models across three standard summarization tasks, two pruning approaches, three instruction-tuned LLMs, and three hallucination evaluation metrics. Surprisingly, we find that pruned LLMs hallucinate less compared to their full-sized counterparts. Our follow-up analysis suggests that pruned models tend to depend more on the source input and less on their parametric knowledge from pre-training for generation. This greater dependency on the source input leads to a higher lexical overlap between generated content and the source input, which can be a reason for the reduction in hallucinations.

Via

Access Paper or Ask Questions

Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Oct 26, 2023

Ahmed Alajrami, Katerina Margatina, Nikolaos Aletras

Figure 1 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Figure 2 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Figure 3 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Figure 4 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Abstract:Understanding how and what pre-trained language models (PLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how information loss in input token characters affects the performance of PLMs. In this study, we address this gap by pre-training language models using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$\% and $77$\% of the full-token model in SuperGLUE and GLUE tasks, respectively.

* To appear at EMNLP 2023

Via

Access Paper or Ask Questions