Abstract:As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with human-generated ones. Familiarity ratings reached moderate-to-strong correlations for both English and Italian metaphors, although correlations weakened for metaphors with high sensorimotor load. Imageability showed moderate correlations in English and moderate-to-strong in Italian. Comprehensibility for English metaphors exhibited the strongest correlations. Overall, larger models outperformed smaller ones and greater human-model misalignment emerged with familiarity and imageability. Machine-generated ratings significantly predicted response times and the EEG amplitude, with a strength comparable to human ratings. Moreover, GPT ratings obtained across independent sessions were highly stable. We conclude that GPT, especially larger models, can validly and reliably replace - or augment - human subjects in rating metaphor properties. Yet, LLMs align worse with humans when dealing with conventionality and multimodal aspects of metaphorical meaning, calling for careful consideration of the nature of stimuli.
Abstract:The Rational Speech Act (RSA) model provides a flexible framework to model pragmatic reasoning in computational terms. However, state-of-the-art RSA models are still fairly distant from modern machine learning techniques and present a number of limitations related to their interpretability and scalability. Here, we introduce a new RSA framework for metaphor understanding that addresses these limitations by providing an explicit formula - based on the mutually shared information between the speaker and the listener - for the estimation of the communicative goal and by learning the rationality parameter using gradient-based methods. The model was tested against 24 metaphors, not limited to the conventional $\textit{John-is-a-shark}$ type. Results suggest an overall strong positive correlation between the distributions generated by the model and the interpretations obtained from the human behavioral data, which increased when the intended meaning capitalized on properties that were inherent to the vehicle concept. Overall, findings suggest that metaphor processing is well captured by a typicality-based Bayesian model, even when more scalable and interpretable, opening up possible applications to other pragmatic phenomena and novel uses for increasing Large Language Models interpretability. Yet, results highlight that the more creative nuances of metaphorical meaning, not strictly encoded in the lexical concepts, are a challenging aspect for machines.