This paper introduces a novel approach for efficiently distilling LLMs into smaller, application-specific models, significantly reducing operational costs and manual labor. Addressing the challenge of deploying computationally intensive LLMs in specific applications or edge devices, this technique utilizes LLMs' reasoning capabilities to generate labels and natural language rationales for unlabeled data. Our approach enhances both finetuning and distillation by employing a multi-task training framework where student models mimic these rationales alongside teacher predictions. Key contributions include the employment of zero-shot prompting to elicit teacher model rationales, reducing the necessity for handcrafted few-shot examples and lowering the overall token count required, which directly translates to cost savings given the pay-per-token billing model of major tech companies' LLM APIs. Additionally, the paper investigates the impact of explanation properties on distillation efficiency, demonstrating that minimal performance loss occurs even when rationale augmentation is not applied across the entire dataset, facilitating further reductions of tokens. This research marks a step toward the efficient training of task-specific models with minimal human intervention, offering substantial cost-savings while maintaining, or even enhancing, performance.
There are various applications, where companies need to decide to which individuals they should best allocate treatment. To support such decisions, uplift models are applied to predict treatment effects on an individual level. Based on the predicted treatment effects, individuals can be ranked and treatment allocation can be prioritized according to this ranking. An implicit assumption, which has not been doubted in the previous uplift modeling literature, is that this treatment prioritization approach tends to bring individuals with high treatment effects to the top and individuals with low treatment effects to the bottom of the ranking. In our research, we show that heteroskedastictity in the training data can cause a bias of the uplift model ranking: individuals with the highest treatment effects can get accumulated in large numbers at the bottom of the ranking. We explain theoretically how heteroskedasticity can bias the ranking of uplift models and show this process in a simulation and on real-world data. We argue that this problem of ranking bias due to heteroskedasticity might occur in many real-world applications and requires modification of the treatment prioritization to achieve an efficient treatment allocation.
This paper explores the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques in cryptocurrency price forecasting, specifically Bitcoin (BTC) and Ethereum (ETH). Focusing on news and social media data, primarily from Twitter and Reddit, we analyse the influence of public sentiment on cryptocurrency valuations using advanced deep learning NLP methods. Alongside conventional price regression, we treat cryptocurrency price forecasting as a classification problem. This includes both the prediction of price movements (up or down) and the identification of local extrema. We compare the performance of various ML models, both with and without NLP data integration. Our findings reveal that incorporating NLP data significantly enhances the forecasting performance of our models. We discover that pre-trained models, such as Twitter-RoBERTa and BART MNLI, are highly effective in capturing market sentiment, and that fine-tuning Large Language Models (LLMs) also yields substantial forecasting improvements. Notably, the BART MNLI zero-shot classification model shows considerable proficiency in extracting bullish and bearish signals from textual data. All of our models consistently generate profit across different validation scenarios, with no observed decline in profits or reduction in the impact of NLP data over time. The study highlights the potential of text analysis in improving financial forecasts and demonstrates the effectiveness of various NLP techniques in capturing nuanced market sentiment.
The increasing usage of new data sources and machine learning (ML) technology in credit modeling raises concerns with regards to potentially unfair decision-making that rely on protected characteristics (e.g., race, sex, age) or other socio-economic and demographic data. The authors demonstrate the impact of such algorithmic bias in the microfinance context. Difficulties in assessing credit are disproportionately experienced among vulnerable groups, however, very little is known about inequities in credit allocation between groups defined, not only by single, but by multiple and intersecting social categories. Drawing from the intersectionality paradigm, the study examines intersectional horizontal inequities in credit access by gender, age, marital status, single parent status and number of children. This paper utilizes data from the Spanish microfinance market as its context to demonstrate how pluralistic realities and intersectional identities can shape patterns of credit allocation when using automated decision-making systems. With ML technology being oblivious to societal good or bad, we find that a more thorough examination of intersectionality can enhance the algorithmic fairness lens to more authentically empower action for equitable outcomes and present a fairer path forward. We demonstrate that while on a high-level, fairness may exist superficially, unfairness can exacerbate at lower levels given combinatorial effects; in other words, the core fairness problem may be more complicated than current literature demonstrates. We find that in addition to legally protected characteristics, sensitive attributes such as single parent status and number of children can result in imbalanced harm. We discuss the implications of these findings for the financial services industry.
In response to growing FinTech competition and the need for improved operational efficiency, this research focuses on understanding the potential of advanced document analytics, particularly using multimodal models, in banking processes. We perform a comprehensive analysis of the diverse banking document landscape, highlighting the opportunities for efficiency gains through automation and advanced analytics techniques in the customer business. Building on the rapidly evolving field of natural language processing (NLP), we illustrate the potential of models such as LayoutXLM, a cross-lingual, multimodal, pre-trained model, for analyzing diverse documents in the banking sector. This model performs a text token classification on German company register extracts with an overall F1 score performance of around 80\%. Our empirical evidence confirms the critical role of layout information in improving model performance and further underscores the benefits of integrating image information. Interestingly, our study shows that over 75% F1 score can be achieved with only 30% of the training data, demonstrating the efficiency of LayoutXLM. Through addressing state-of-the-art document analysis frameworks, our study aims to enhance process efficiency and demonstrate the real-world applicability and benefits of multimodal models within banking.
We propose a novel method for predicting time-to-event in the presence of cure fractions based on flexible survivals models integrated into a deep neural network framework. Our approach allows for non-linear relationships and high-dimensional interactions between covariates and survival and is suitable for large-scale applications. Furthermore, we allow the method to incorporate an identified predictor formed of an additive decomposition of interpretable linear and non-linear effects and add an orthogonalization layer to capture potential higher dimensional interactions. We demonstrate the usefulness and computational efficiency of our method via simulations and apply it to a large portfolio of US mortgage loans. Here, we find not only a better predictive performance of our framework but also a more realistic picture of covariate effects.
There has been intensive research regarding machine learning models for predicting bankruptcy in recent years. However, the lack of interpretability limits their growth and practical implementation. This study proposes a data-driven explainable case-based reasoning (CBR) system for bankruptcy prediction. Empirical results from a comparative study show that the proposed approach performs superior to existing, alternative CBR systems and is competitive with state-of-the-art machine learning models. We also demonstrate that the asymmetrical feature similarity comparison mechanism in the proposed CBR system can effectively capture the asymmetrically distributed nature of financial attributes, such as a few companies controlling more cash than the majority, hence improving both the accuracy and explainability of predictions. In addition, we delicately examine the explainability of the CBR system in the decision-making process of bankruptcy prediction. While much research suggests a trade-off between improving prediction accuracy and explainability, our findings show a prospective research avenue in which an explainable model that thoroughly incorporates data attributes by design can reconcile the dilemma.
Estimating treatment effects is one of the most challenging and important tasks of data analysts. Traditional statistical methods aim to estimate average treatment effects over a population. While being highly useful, such average treatment effects do not help to decide which individuals profit most by the treatment. This is where uplift modeling becomes important. Uplift models help to select the right individuals for treatment, to maximize the overall treatment effect (uplift). A challenging problem in uplift modeling is to evaluate the models. Previous literature suggests methods like the Qini curve and the transformed outcome mean squared error. However, these metrics suffer from variance: Their evaluations are strongly affected by random noise in the data, which makes these evaluations to a certain degree arbitrary. In this paper, we analyze the variance of the uplift evaluation metrics, on randomized controlled trial data, in a sound statistical manner. We propose certain outcome adjustment methods, for which we prove theoretically and empirically, that they reduce the variance of the uplift evaluation metrics. Our statistical analysis and the proposed outcome adjustment methods are a step towards a better evaluation practice in uplift modeling.
Generative models synthesize image data with great success regarding sampling quality, diversity and feature disentanglement. Generative models for time series lack these benefits due to a missing representation, which captures temporal dynamics and allows inversion for sampling. The paper proposes the intertemporal return plot (IRP) representation to facilitate the use of image-based generative adversarial networks for time series generation. The representation proves effective in capturing time series characteristics and, compared to alternative representations, benefits from invertibility and scale-invariance. Empirical benchmarks confirm these features and demonstrate that the IRP enables an off-the-shelf Wasserstein GAN with gradient penalty to sample realistic time series, which outperform a specialized RNN-based GAN, while simultaneously reducing model complexity.
Recurrent neural networks (RNNs) like long short-term memory networks (LSTMs) and gated recurrent units (GRUs) are a popular choice for modeling sequential data. Their gating mechanism permits weighting previous history encoded in a hidden state with new information from incoming observations. In many applications, such as medical records, observations times are irregular and carry important information. However, LSTMs and GRUs assume constant time intervals between observations. To address this challenge, we propose continuous recurrent units (CRUs) -a neural architecture that can naturally handle irregular time intervals between observations. The gating mechanism of the CRU employs the continuous formulation of a Kalman filter and alternates between (1) continuous latent state propagation according to a linear stochastic differential equation (SDE) and (2) latent state updates whenever a new observation comes in. In an empirical study, we show that the CRU can better interpolate irregular time series than neural ordinary differential equation (neural ODE)-based models. We also show that our model can infer dynamics from im-ages and that the Kalman gain efficiently singles out candidates for valuable state updates from noisy observations.