This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. We identify that the Covid-19 dataset may project onto two low-dimensional manifolds without significant information loss. The low dimensionality suggests strong dependency among the standardised growth rates of cases and deaths per capita and the OxCGRT Covid-19 Stringency Index for a country over 2020-2021. Given the low dimensional structure, it may be feasible to model observable Covid-19 dynamics with few parameters. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. Moreover, we highlight that high-income countries are more likely to lie on low-dimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from Covid-19. Finally, we temporally stratify the dataset to examine the intrinsic dimension at a more granular level throughout the Covid-19 pandemic.
For stochastic models with intractable likelihood functions, approximate Bayesian computation offers a way of approximating the true posterior through repeated comparisons of observations with simulated model outputs in terms of a small set of summary statistics. These statistics need to retain the information that is relevant for constraining the parameters but cancel out the noise. They can thus be seen as thermodynamic state variables, for general stochastic models. For many scientific applications, we need strictly more summary statistics than model parameters to reach a satisfactory approximation of the posterior. Therefore, we propose to use the inner dimension of deep neural network based Autoencoders as summary statistics. To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information on the noise that has been used to generate the training data. We validate the approach empirically on two types of stochastic models.
Cardio/cerebrovascular diseases (CVD) have become one of the major health issue in our societies. But recent studies show that the present clinical tests to detect CVD are ineffectual as they do not consider different stages of platelet activation or the molecular dynamics involved in platelet interactions and are incapable to consider inter-individual variability. Here we propose a stochastic platelet deposition model and an inferential scheme for uncertainty quantification of these parameters using Approximate Bayesian Computation and distance learning. Finally we show that our methodology can learn biologically meaningful parameters, which are the specific dysfunctioning parameters in each type of patients, from data collected from healthy volunteers and patients. This work opens up an unprecedented opportunity of personalized pathological test for CVD detection and medical treatment. Also our proposed methodology can be used to other fields of science where we would need machine learning tools to be interpretable.
It is well known that a small number of variables is often sufficient to effectively describe high-dimensional data. This number is called the intrinsic dimension (ID) of the data. What is not so commonly known is that the ID can vary within the same dataset. This fact has been highlighted in technical discussions, but seldom exploited to gain practical insight in the data structure. Here we develop a simple and robust approach to cluster regions with the same local ID in a given data landscape. Surprisingly, we find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded vs unfolded configurations in a protein molecular dynamics trajectory, active vs non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. Our results show that a simple topological feature, the local ID, is sufficient to uncover a rich structure in high-dimensional data landscapes.
A major line of contemporary research on complex networks is based on the development of statistical models that specify the local motifs associated with macro-structural properties observed in actual networks. This statistical approach becomes increasingly problematic as network size increases. In the context of current research on efficient estimation of models for large network data sets, we propose a fast algorithm for maximum likelihood estimation (MLE) that afords a signifcant increase in the size of networks amenable to direct empirical analysis. The algorithm we propose in this paper relies on properties of Markov chains at equilibrium, and for this reason it is called equilibrium expectation (EE). We demonstrate the performance of the EE algorithm in the context of exponential random graphmodels (ERGMs) a family of statistical models commonly used in empirical research based on network data observed at a single period in time. Thus far, the lack of efcient computational strategies has limited the empirical scope of ERGMs to relatively small networks with a few thousand nodes. The approach we propose allows a dramatic increase in the size of networks that may be analyzed using ERGMs. This is illustrated in an analysis of several biological networks and one social network with 104,103 nodes
Infectious diseases are studied to understand their spreading mechanisms, to evaluate control strategies and to predict the risk and course of future outbreaks. Because people only interact with a small number of individuals, and because the structure of these interactions matters for spreading processes, the pairwise relationships between individuals in a population can be usefully represented by a network. Although the underlying processes of transmission are different, the network approach can be used to study the spread of pathogens in a contact network or the spread of rumors in an online social network. We study simulated simple and complex epidemics on synthetic networks and on two empirical networks, a social / contact network in an Indian village and an online social network in the U.S. Our goal is to learn simultaneously about the spreading process parameters and the source node (first infected node) of the epidemic, given a fixed and known network structure, and observations about state of nodes at several points in time. Our inference scheme is based on approximate Bayesian computation (ABC), an inference technique for complex models with likelihood functions that are either expensive to evaluate or analytically intractable. ABC enables us to adopt a Bayesian approach to the problem despite the posterior distribution being very complex. Our method is agnostic about the topology of the network and the nature of the spreading process. It generally performs well and, somewhat counter-intuitively, the inference problem appears to be easier on more heterogeneous network topologies, which enhances its future applicability to real-world settings where few networks have homogeneous topologies.