Normative modeling learns a healthy reference distribution and quantifies subject-specific deviations to capture heterogeneous disease effects. In Alzheimers disease (AD), multimodal neuroimaging offers complementary signals but VAE-based normative models often (i) fit the healthy reference distribution imperfectly, inflating false positives, and (ii) use posterior aggregation (e.g., PoE/MoE) that can yield weak multimodal fusion in the shared latent space. We propose mmSIVAE, a multimodal soft-introspective variational autoencoder combined with Mixture-of-Product-of-Experts (MOPOE) aggregation to improve reference fidelity and multimodal integration. We compute deviation scores in latent space and feature space as distances from the learned healthy distributions, and map statistically significant latent deviations to regional abnormalities for interpretability. On ADNI MRI regional volumes and amyloid PET SUVR, mmSIVAE improves reconstruction on held-out controls and produces more discriminative deviation scores for outlier detection than VAE baselines, with higher likelihood ratios and clearer separation between control and AD-spectrum cohorts. Deviation maps highlight region-level patterns aligned with established AD-related changes. More broadly, our results highlight the importance of training objectives that prioritize reference-distribution fidelity and robust multimodal posterior aggregation for normative modeling, with implications for deviation-based analysis across multimodal clinical data.