Abstract:Small longitudinal clinical cohorts, common in maternal health, rare diseases, and early-phase trials, limit computational modeling: too few patients to train reliable models, yet too costly and slow to expand through additional enrollment. We present multiplicity-weighted Stochastic Attention (SA), a generative framework based on modern Hopfield network theory that addresses this gap. SA embeds real patient profiles as memory patterns in a continuous energy landscape and generates novel synthetic patients via Langevin dynamics that interpolate between stored patterns while preserving the geometry of the original cohort. Per-pattern multiplicity weights enable targeted amplification of rare clinical subgroups at inference time without retraining. We applied SA to a longitudinal coagulation dataset from 23 pregnant patients spanning 72 biochemical features across 3 visits (pre-pregnancy baseline, first trimester, and third trimester), including rare subgroups such as polycystic ovary syndrome and preeclampsia. Synthetic patients generated by SA were statistically, structurally, and mechanistically indistinguishable from their real counterparts across multiple independent validation tests, including an ordinary differential equation model of the coagulation cascade. A downstream utility test further showed that a mechanistic model calibrated entirely on synthetic patients predicted held-out real patient outcomes as well as one calibrated on real data. These results demonstrate that SA can produce clinically useful synthetic cohorts from very small longitudinal datasets, enabling data-augmented modeling in small-cohort settings.
Abstract:Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored sequences equally and cannot direct generation toward a functional subset of interest. We show that a single scalar parameter, added as a bias to the sampler's attention logits, continuously shifts generation from the full family toward a user-specified subset, with no retraining and no change to the model architecture. A practitioner supplies a small set of sequences (for example, hits from a binding screen) and a multiplicity ratio that controls how strongly generation favors them. The method is agnostic to what the subset represents: binding, stability, specificity, or any other property. We find that the conditioning is exact at the level of the sampler's internal representation, but that the decoded sequence phenotype can fall short because the dimensionality reduction used to encode sequences does not always preserve the residue-level variation that defines the functional split. We term this discrepancy the calibration gap and show that it is predicted by a simple geometric measure of how well the encoding separates the functional subset from the rest of the family. Experiments on five Pfam families (Kunitz, SH3, WW, Homeobox, and Forkhead domains) confirm the monotonic relationship between separation and gap across a fourfold range of geometries. Applied to omega-conotoxin peptides targeting a calcium channel involved in pain signaling, curated seeding from 23 characterized binders produces over a thousand candidates that preserve the primary pharmacophore and all experimentally identified binding determinants. These results show that stochastic attention enables practitioners to expand a handful of experimentally characterized sequences into diverse candidate libraries without retraining a generative model.
Abstract:Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training-free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed-form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per-position amino acid frequencies.
Abstract:Generating synthetic financial time series that preserve statistical properties of real market data is essential for stress testing, risk model validation, and scenario design. Existing approaches, from parametric models to deep generative networks, struggle to simultaneously reproduce heavy-tailed distributions, negligible linear autocorrelation, and persistent volatility clustering. We propose a hybrid hidden Markov framework that discretizes continuous excess growth rates into Laplace quantile-defined market states and augments regime switching with a Poisson-driven jump-duration mechanism to enforce realistic tail-state dwell times. Parameters are estimated by direct transition counting, bypassing the Baum-Welch EM algorithm. Synthetic data quality is evaluated using Kolmogorov-Smirnov and Anderson-Darling pass rates for distributional fidelity, and ACF mean absolute error for temporal structure. Applied to ten years of SPY data across 1,000 simulated paths, the framework achieves KS and AD pass rates exceeding 97% and 91% in-sample and 94% out-of-sample (calendar year 2025), partially reproducing the ARCH effect that standard regime-switching models miss. No single model dominates all quality dimensions: GARCH(1,1) reproduces volatility clustering more accurately but fails distributional tests (5.5% KS pass rate), while the standard HMM without jumps achieves higher distributional fidelity but cannot generate persistent high-volatility regimes. The proposed framework offers the best joint quality profile across distributional, temporal, and tail-coverage metrics. A Single-Index Model extension propagates the SPY factor path to a 424-asset universe, enabling scalable correlated synthetic path generation while preserving cross-sectional correlation structure.