Abstract:Shoe print evidence recovered from crime scenes plays a key role in forensic investigations. By examining shoe prints, investigators can determine details of the footwear worn by suspects. However, establishing that a suspect's shoes match the make and model of a crime scene print may not be sufficient. Typically, thousands of shoes of the same size, make, and model are manufactured, any of which could be responsible for the print. Accordingly, a popular approach used by investigators is to examine the print for signs of ``accidentals,'' i.e., cuts, scrapes, and other features that accumulate on shoe soles after purchase due to wear. While some patterns of accidentals are common on certain types of shoes, others are highly distinctive, potentially distinguishing the suspect's shoe from all others. Quantifying the rarity of a pattern is thus essential to accurately measuring the strength of forensic evidence. In this study, we address this task by developing a hierarchical Bayesian model. Our improvement over existing methods primarily stems from two advancements. First, we frame our approach in terms of a latent Gaussian model, thus enabling inference to be efficiently scaled to large collections of annotated shoe prints via integrated nested Laplace approximations. Second, we incorporate spatially varying coefficients to model the relationship between shoes' tread patterns and accidental locations. We demonstrate these improvements through superior performance on held-out data, which enhances accuracy and reliability in forensic shoe print analysis.
Abstract:Prediction uncertainty quantification is a key research topic in recent years scientific and business problems. In insurance industries (\cite{parodi2023pricing}), assessing the range of possible claim costs for individual drivers improves premium pricing accuracy. It also enables insurers to manage risk more effectively by accounting for uncertainty in accident likelihood and severity. In the presence of covariates, a variety of regression-type models are often used for modeling insurance claims, ranging from relatively simple generalized linear models (GLMs) to regularized GLMs to gradient boosting models (GBMs). Conformal predictive inference has arisen as a popular distribution-free approach for quantifying predictive uncertainty under relatively weak assumptions of exchangeability, and has been well studied under the classic linear regression setting. In this work, we propose new non-conformity measures for GLMs and GBMs with GLM-type loss. Using regularized Tweedie GLM regression and LightGBM with Tweedie loss, we demonstrate conformal prediction performance with these non-conformity measures in insurance claims data. Our simulation results favor the use of locally weighted Pearson residuals for LightGBM over other methods considered, as the resulting intervals maintained the nominal coverage with the smallest average width.
Abstract:The Tweedie exponential dispersion family is a popular choice among many to model insurance losses that consist of zero-inflated semicontinuous data. In such data, it is often important to obtain credibility (inference) of the most important features that describe the endogenous variables. Post-selection inference is the standard procedure in statistics to obtain confidence intervals of model parameters after performing a feature extraction procedure. For a linear model, the lasso estimate often has non-negligible estimation bias for large coefficients corresponding to exogenous variables. To have valid inference on those coefficients, it is necessary to correct the bias of the lasso estimate. Traditional statistical methods, such as hypothesis testing or standard confidence interval construction might lead to incorrect conclusions during post-selection, as they are generally too optimistic. Here we discuss a few methodologies for constructing confidence intervals of the coefficients after feature selection in the Generalized Linear Model (GLM) family with application to insurance data.