Abstract:Large Language Models (LLMs) are distinguished by their architecture, which dictates their parameter size and performance capabilities. Social scientists have increasingly adopted LLMs for text classification tasks, which are difficult to scale with human coders. While very large, closed-source models often deliver superior performance, their use presents significant risks. These include lack of transparency, potential exposure of sensitive data, challenges to replicability, and dependence on proprietary systems. Additionally, their high costs make them impractical for large-scale research projects. In contrast, open-source models, although available in various sizes, may underperform compared to commercial alternatives if used without further fine-tuning. However, open-source models offer distinct advantages: they can be run locally (ensuring data privacy), fine-tuned for specific tasks, shared within the research community, and integrated into reproducible workflows. This study demonstrates that small, fine-tuned open-source LLMs can achieve equal or superior performance to models such as ChatGPT-4. We further explore the relationship between training set size and fine-tuning efficacy in open-source models. Finally, we propose a hybrid workflow that leverages the strengths of both open and closed models, offering a balanced approach to performance, transparency, and reproducibility.
Abstract:This work introduces a new concept of functional areas called Mobility Functional Areas (MFAs), i.e., the geographic zones highly interconnected according to the analysis of mobile positioning data. The MFAs do not coincide necessarily with administrative borders as they are built observing natural human mobility and, therefore, they can be used to inform, in a bottom-up approach, local transportation, health and economic policies. After presenting the methodology behind the MFAs, this study focuses on the link between the COVID-19 pandemic and the MFAs in Austria. It emerges that the MFAs registered an average number of infections statistically larger than the areas in the rest of the country, suggesting the usefulness of the MFAs in the context of targeted re-escalation policy responses to this health crisis.
Abstract:This study analyzes the impact of the COVID-19 pandemic on the subjective well-being as measured through Twitter data indicators for Japan and Italy. It turns out that, overall, the subjective well-being dropped by 11.7% for Italy and 8.3% for Japan in the first nine months of 2020 compared to the last two months of 2019 and even more compared to the historical mean of the indexes. Through a data science approach we try to identify the possible causes of this drop down by considering several explanatory variables including, climate and air quality data, number of COVID-19 cases and deaths, Facebook Covid and flu symptoms global survey, Google Trends data and coronavirus-related searches, Google mobility data, policy intervention measures, economic variables and their Google Trends proxies, as well as health and stress proxy variables based on big data. We show that a simple static regression model is not able to capture the complexity of well-being and therefore we propose a dynamic elastic net approach to show how different group of factors may impact the well-being in different periods, even over a short time length, and showing further country-specific aspects. Finally, a structural equation modeling analysis tries to address the causal relationships among the COVID-19 factors and subjective well-being showing that, overall, prolonged mobility restrictions,flu and Covid-like symptoms, economic uncertainty, social distancing and news about the pandemic have negative effects on the subjective well-being.
Abstract:The effects of the so-called "refugee crisis" of 2015-16 continue to dominate much of the European political agenda. Migration flows were sudden and unexpected, exposing significant shortcomings in the field of migration forecasting and leaving governments and NGOs unprepared. Migration is a complex system typified by episodic variation, underpinned by causal factors that are interacting, highly context dependent and short-lived. Correspondingly, migration nowcasts rely on scattered low-quality data and much-needed forecasts are local and inconsistent. Here we describe a data-driven adaptive system for forecasting asylum applications in the European Union (EU), built on machine learning algorithms that combine administrative data with non-traditional data sources at scale. We exploit three tiers of data: geolocated events and internet searches in countries of origin, detections at the EU external border, and asylum recognition rates in the EU, to effectively forecast individual asylum-migration flows up to four weeks ahead with high accuracy. Uniquely our approach a) models individual country-to-country migration flows; b) detects migration drivers early onset; c) anticipates lagged effects; d) estimates the effect of individual drivers; and e) describes how patterns of drivers shift over time. This is, to our knowledge, the first comprehensive system for forecasting asylum applications based on an unsupervised algorithm and data at scale. Importantly, this approach can be extended to forecast other migration social-economic indicators.
Abstract:This study analyzes the usage of Japanese gendered language on Twitter. Starting from a collection of 408 million Japanese tweets from 2015 till 2019 and an additional sample of 2355 manually classified Twitter accounts timelines into gender and categories (politicians, musicians, etc). A large scale textual analysis is performed on this corpus to identify and examine sentence-final particles (SFPs) and first-person pronouns appearing in the texts. It turns out that gendered language is in fact used also on Twitter, in about 6% of the tweets, and that the prescriptive classification into "male" and "female" language does not always meet the expectations, with remarkable exceptions. Further, SFPs and pronouns show increasing or decreasing trends, indicating an evolution of the language used on Twitter.
Abstract:In this paper a new dissimilarity measure to identify groups of assets dynamics is proposed. The underlying generating process is assumed to be a diffusion process solution of stochastic differential equations and observed at discrete time. The mesh of observations is not required to shrink to zero. As distance between two observed paths, the quadratic distance of the corresponding estimated Markov operators is considered. Analysis of both synthetic data and real financial data from NYSE/NASDAQ stocks, give evidence that this distance seems capable to catch differences in both the drift and diffusion coefficients contrary to other commonly used metrics.