Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Suchi Saria

Computer Science, Statistics, and Health Policy, Johns Hopkins University, Baltimore, MD, USA, ML, AI and Healthcare Lab, Bayesian Health, New York, NY, USA

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

May 13, 2026

Hsing-Huan Chung, Shijun Li, Yoav Wald, Xing Han, Suchi Saria, Joydeep Ghosh

Abstract:Multimodal irregular time series (MITS) consist of asynchronous and irregularly sampled observations from heterogeneous numerical and textual channels. In healthcare, for example, patients' electronic health records (EHR) include irregular lab measurements and clinical notes. The irregular timing and channel patterns of observations carry predictive signal alongside the numerical values and textual content. LLMs are natural candidates for processing such heterogeneous data, given their extensive pretrained knowledge spanning textual and numerical domains. We introduce MILM (Multimodal Irregular time series Language Model), which represents MITS as time-ordered triplets in Extensible Markup Language (XML) format and fine-tunes an LLM through a two-stage strategy for MITS classification. The first stage trains on value-redacted MITS to predict from sampling patterns alone, and the second stage trains on full MITS to jointly model sampling patterns and observed values. Our two-stage model (MILM-2S) and its single-stage counterpart (MILM-Direct) achieve the best and second-best average performance on multiple EHR datasets. Further value redaction evaluations confirm that sampling patterns carry predictive signal and that MILM-2S learns to exploit them. In the value pending evaluation we introduce, where some values are unavailable at prediction time, MILM-2S outperforms MILM-Direct by a larger margin compared to standard evaluation. For MILM-2S, preserving the time and channel of value-pending observations as additional sampling information further improves in-hospital mortality prediction.

Via

Access Paper or Ask Questions

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

May 10, 2026

Xing Han, Shravan Chaudhari, Tanvi Ranade, Rama Chellappa, Suchi Saria

Abstract:Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.

* 37 pages, 25 figures, 6 tables

Via

Access Paper or Ask Questions

Conformal Policy Control

Mar 02, 2026

Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, Samuel Stanton

Abstract:An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.

Via

Access Paper or Ask Questions

Improving Coverage in Combined Prediction Sets with Weighted p-values

May 17, 2025

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

Abstract:Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets, assuming exchangeability. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual $1-\alpha$ coverage inevitably weakens the overall guarantee, typically resulting in $1-2\alpha$ worst-case coverage. In this work, we propose a framework for the weighted aggregation of prediction sets, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the $1-2\alpha$ guarantee of the combined models and the $1-\alpha$ guarantee of an individual model depending on the distribution of weights. We extend our framework to data-dependent weights, and we derive a general procedure for data-dependent weight aggregation that maintains finite-sample validity. We demonstrate the effectiveness of our methods through experiments on synthetic and real data in the mixture-of-experts setting, and we show that aggregation with data-dependent weights provides a form of adaptive coverage.

Via

Access Paper or Ask Questions

WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

May 12, 2025

Drew Prinster, Xing Han, Anqi Liu, Suchi Saria

Figure 1 for WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Figure 2 for WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Figure 3 for WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Figure 4 for WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Abstract:Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection -- especially the tools of conformal test martingales (CTMs) and anytime-valid inference -- offer promising approaches to this monitoring task. However, existing methods are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (such as data shifts that violate certain exchangeability assumptions), do not allow for online adaptation in response to shifts, and/or do not enable root-cause analysis of any degradation. In this paper, we expand the scope of these monitoring methods by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution) while quickly detecting and diagnosing more severe shifts, such as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.

* To be published in The International Conference on Machine Learning (ICML), 2025

Via

Access Paper or Ask Questions

WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales

May 07, 2025

Drew Prinster, Xing Han, Anqi Liu, Suchi Saria

Figure 1 for WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales

Figure 2 for WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales

Figure 3 for WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales

Figure 4 for WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales

Abstract:Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection -- especially the tools of conformal test martingales (CTMs) and anytime-valid inference -- offer promising approaches to this monitoring task. However, existing methods are restricted to monitoring limited hypothesis classes or ``alarm criteria,'' such as data shifts that violate certain exchangeability assumptions, or do not allow for online adaptation in response to shifts. In this paper, we expand the scope of these monitoring methods by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that accommodate online adaptation to mild covariate shifts (in the marginal input distribution) while raising alarms in response to more severe shifts, such as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.

* To be published in The International Conference on Machine Learning (ICML), 2025

Via

Access Paper or Ask Questions

Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Apr 10, 2025

Hsing-Huan Chung, Shravan Chaudhari, Xing Han, Yoav Wald, Suchi Saria, Joydeep Ghosh

Figure 1 for Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Figure 2 for Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Figure 3 for Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Figure 4 for Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Abstract:Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal relationships between edge events. In this paper, we study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. These results can be readily used to positively impact the design choices of a wide variety of dynamic graph learning architectures. The experimental code is available at: https://github.com/hsinghuan/dg-linear-time.git.

Via

Access Paper or Ask Questions

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Oct 03, 2024

Huy Nguyen, Xing Han, Carl William Harris, Suchi Saria, Nhat Ho

Figure 1 for On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Figure 2 for On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Figure 3 for On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Figure 4 for On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Abstract:With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our investigation highlights the advantages of using varied gating functions, moving beyond softmax gating within HMoE frameworks. We theoretically demonstrate that applying tailored gating functions to each expert group allows HMoE to achieve robust results, even when optimal gating functions are applied only at select hierarchical levels. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements.

* 58 pages

Via

Access Paper or Ask Questions

Conformal Validity Guarantees Exist for Any Data Distribution

May 10, 2024

Drew Prinster, Samuel Stanton, Anqi Liu, Suchi Saria

Figure 1 for Conformal Validity Guarantees Exist for Any Data Distribution

Figure 2 for Conformal Validity Guarantees Exist for Any Data Distribution

Figure 3 for Conformal Validity Guarantees Exist for Any Data Distribution

Figure 4 for Conformal Validity Guarantees Exist for Any Data Distribution

Abstract:As machine learning (ML) gains widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when ML systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction has emerged as a promising approach to uncertainty and risk quantification, but existing variants either fail to accommodate sequences of data-dependent shifts, or do not fully exploit the fact that agent-induced shift is under our control. In this work we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones, although it is exceedingly impractical to compute in the most general case. For practical applications, we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.

* ICML 2024. Code available at https://github.com/drewprinster/ conformal-mfcs

Via

Access Paper or Ask Questions

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Feb 05, 2024

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, Suchi Saria

Figure 1 for FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Figure 2 for FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Figure 3 for FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Figure 4 for FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

Abstract:As machine learning models in critical fields increasingly grapple with multimodal data, they face the dual challenges of handling a wide array of modalities, often incomplete due to missing elements, and the temporal irregularity and sparsity of collected samples. Successfully leveraging this complex data, while overcoming the scarcity of high-quality training samples, is key to improving these models' predictive performance. We introduce ``FuseMoE'', a mixture-of-experts framework incorporated with an innovative gating function. Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories. Theoretically, our unique gating function contributes to enhanced convergence rates, leading to better performance in multiple downstream tasks. The practical utility of FuseMoE in real world is validated by a challenging set of clinical risk prediction tasks.

* 35 pages, 8 tables, 5 figures

Via

Access Paper or Ask Questions