Alert button
Picture for Marzyeh Ghassemi

Marzyeh Ghassemi

Alert button

Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Aug 16, 2023
Xuhai Xu, Bingshen Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, Dakuo Wang

Figure 1 for Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data
Figure 2 for Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data
Figure 3 for Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data
Figure 4 for Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4, on various mental health prediction tasks via online text data. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for the mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on the mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

Viaarxiv icon

Deep Metric Learning for the Hemodynamics Inference with Electrocardiogram Signals

Aug 09, 2023
Hyewon Jeong, Collin M. Stultz, Marzyeh Ghassemi

Heart failure is a debilitating condition that affects millions of people worldwide and has a significant impact on their quality of life and mortality rates. An objective assessment of cardiac pressures remains an important method for the diagnosis and treatment prognostication for patients with heart failure. Although cardiac catheterization is the gold standard for estimating central hemodynamic pressures, it is an invasive procedure that carries inherent risks, making it a potentially dangerous procedure for some patients. Approaches that leverage non-invasive signals - such as electrocardiogram (ECG) - have the promise to make the routine estimation of cardiac pressures feasible in both inpatient and outpatient settings. Prior models trained to estimate intracardiac pressures (e.g., mean pulmonary capillary wedge pressure (mPCWP)) in a supervised fashion have shown good discriminatory ability but have been limited to the labeled dataset from the heart failure cohort. To address this issue and build a robust representation, we apply deep metric learning (DML) and propose a novel self-supervised DML with distance-based mining that improves the performance of a model with limited labels. We use a dataset that contains over 5.4 million ECGs without concomitant central pressure labels to pre-train a self-supervised DML model which showed improved classification of elevated mPCWP compared to self-supervised contrastive baselines. Additionally, the supervised DML model that is using ECGs with access to 8,172 mPCWP labels demonstrated significantly better performance on the mPCWP regression task compared to the supervised baseline. Moreover, our data suggest that DML yields models that are performant across patient subgroups, even when some patient subgroups are under-represented in the dataset. Our code is available at https://github.com/mandiehyewon/ssldml

Viaarxiv icon

VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Aug 03, 2023
Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, Marzyeh Ghassemi, James Thorne, Jaeseok Choi, O-Kil Kwon, Edward Choi

Figure 1 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
Figure 2 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
Figure 3 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
Figure 4 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. Given that most large-scale deep learning models act as black boxes and cannot be manually controlled, analyzing the similarity between models and humans can be a proxy measure for ensuring AI safety. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification, a fundamental task in machine perception. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios that may arise in the real world and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, based on the quantity and clarity of visual information in an image and further divided into eight categories. All samples have a gold human perception label; even Uncertain (severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and seven abstention methods. Our code and data is available at \url{https://github.com/jiyounglee-0523/VisAlign}.

Viaarxiv icon

Evaluating the Impact of Social Determinants on Health Prediction

May 22, 2023
Ming Ying Yang, Gloria Hyunjung Kwak, Tom Pollard, Leo Anthony Celi, Marzyeh Ghassemi

Figure 1 for Evaluating the Impact of Social Determinants on Health Prediction
Figure 2 for Evaluating the Impact of Social Determinants on Health Prediction
Figure 3 for Evaluating the Impact of Social Determinants on Health Prediction
Figure 4 for Evaluating the Impact of Social Determinants on Health Prediction

Social determinants of health (SDOH) -- the conditions in which people live, grow, and age -- play a crucial role in a person's health and well-being. There is a large, compelling body of evidence in population health studies showing that a wide range of SDOH is strongly correlated with health outcomes. Yet, a majority of the risk prediction models based on electronic health records (EHR) do not incorporate a comprehensive set of SDOH features as they are often noisy or simply unavailable. Our work links a publicly available EHR database, MIMIC-IV, to well-documented SDOH features. We investigate the impact of such features on common EHR prediction tasks across different patient populations. We find that community-level SDOH features do not improve model performance for a general patient population, but can improve data-limited model fairness for specific subpopulations. We also demonstrate that SDOH features are vital for conducting thorough audits of algorithmic biases beyond protective attributes. We hope the new integrated EHR-SDOH database will enable studies on the relationship between community health and individual outcomes and provide new benchmarks to study algorithmic biases beyond race, gender, and age.

Viaarxiv icon

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

May 18, 2023
Yuxin Xiao, Shulammite Lim, Tom Joseph Pollard, Marzyeh Ghassemi

Figure 1 for In the Name of Fairness: Assessing the Bias in Clinical Record De-identification
Figure 2 for In the Name of Fairness: Assessing the Bias in Clinical Record De-identification
Figure 3 for In the Name of Fairness: Assessing the Bias in Clinical Record De-identification
Figure 4 for In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.

* Accepted by FAccT 2023 
Viaarxiv icon

Improving Mutual Information Estimation with Annealed and Energy-Based Bounds

Mar 13, 2023
Rob Brekelmans, Sicong Huang, Marzyeh Ghassemi, Greg Ver Steeg, Roger Grosse, Alireza Makhzani

Figure 1 for Improving Mutual Information Estimation with Annealed and Energy-Based Bounds
Figure 2 for Improving Mutual Information Estimation with Annealed and Energy-Based Bounds
Figure 3 for Improving Mutual Information Estimation with Annealed and Energy-Based Bounds
Figure 4 for Improving Mutual Information Estimation with Annealed and Energy-Based Bounds

Mutual information (MI) is a fundamental quantity in information theory and machine learning. However, direct estimation of MI is intractable, even if the true joint probability density for the variables of interest is known, as it involves estimating a potentially high-dimensional log partition function. In this work, we present a unifying view of existing MI bounds from the perspective of importance sampling, and propose three novel bounds based on this approach. Since accurate estimation of MI without density information requires a sample size exponential in the true MI, we assume either a single marginal or the full joint density information is known. In settings where the full joint density is available, we propose Multi-Sample Annealed Importance Sampling (AIS) bounds on MI, which we demonstrate can tightly estimate large values of MI in our experiments. In settings where only a single marginal distribution is known, we propose Generalized IWAE (GIWAE) and MINE-AIS bounds. Our GIWAE bound unifies variational and contrastive bounds in a single framework that generalizes InfoNCE, IWAE, and Barber-Agakov bounds. Our MINE-AIS method improves upon existing energy-based methods such as MINE-DV and MINE-F by directly optimizing a tighter lower bound on MI. MINE-AIS uses MCMC sampling to estimate gradients for training and Multi-Sample AIS for evaluating the bound. Our methods are particularly suitable for evaluating MI in deep generative models, since explicit forms of the marginal or joint densities are often available. We evaluate our bounds on estimating the MI of VAEs and GANs trained on the MNIST and CIFAR datasets, and showcase significant gains over existing bounds in these challenging settings with high ground truth MI.

* A shorter version appeared in the International Conference on Learning Representations (ICLR) 2022 
Viaarxiv icon

Change is Hard: A Closer Look at Subpopulation Shift

Feb 23, 2023
Yuzhe Yang, Haoran Zhang, Dina Katabi, Marzyeh Ghassemi

Figure 1 for Change is Hard: A Closer Look at Subpopulation Shift
Figure 2 for Change is Hard: A Closer Look at Subpopulation Shift
Figure 3 for Change is Hard: A Closer Look at Subpopulation Shift
Figure 4 for Change is Hard: A Closer Look at Subpopulation Shift

Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https://github.com/YyzHarry/SubpopBench.

* Code and data are available at https://github.com/YyzHarry/SubpopBench 
Viaarxiv icon

Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

Jan 13, 2023
Taylor W. Killian, Sonali Parbhoo, Marzyeh Ghassemi

Figure 1 for Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning
Figure 2 for Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning
Figure 3 for Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning
Figure 4 for Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

In safety-critical decision-making scenarios being able to identify worst-case outcomes, or dead-ends is crucial in order to develop safe and reliable policies in practice. These situations are typically rife with uncertainty due to unknown or stochastic characteristics of the environment as well as limited offline training data. As a result, the value of a decision at any time point should be based on the distribution of its anticipated effects. We propose a framework to identify worst-case decision points, by explicitly estimating distributions of the expected return of a decision. These estimates enable earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task. We demonstrate the utility of Distributional Dead-end Discovery (DistDeD) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.

Viaarxiv icon

Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors

Nov 20, 2022
Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi

Figure 1 for Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
Figure 2 for Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
Figure 3 for Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
Figure 4 for Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors

Large pre-trained models decay over long-term deployment as input distributions shift, user requirements change, or crucial knowledge gaps are discovered. Recently, model editors have been proposed to modify a model's behavior by adjusting its weights during deployment. However, when editing the same model multiple times, these approaches quickly decay a model's performance on upstream data and forget how to fix previous errors. We propose and study a novel Lifelong Model Editing setting, where streaming errors are identified for a deployed model and we update the model to correct its predictions without influencing unrelated inputs without access to training edits, exogenous datasets, or any upstream data for the edited model. To approach this problem, we introduce General Retrieval Adaptors for Continual Editing, or GRACE, which learns to cache a chosen layer's activations in an adaptive codebook as edits stream in, leaving original model weights frozen. GRACE can thus edit models thousands of times in a row using only streaming errors, while minimally influencing unrelated inputs. Experimentally, we show that GRACE improves over recent model editors and generalizes to unseen inputs. Our code is available at https://www.github.com/thartvigsen/grace.

Viaarxiv icon

"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts

Oct 19, 2022
Haoran Zhang, Harvineet Singh, Marzyeh Ghassemi, Shalmali Joshi

Figure 1 for "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
Figure 2 for "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
Figure 3 for "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
Figure 4 for "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts

Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These manifest as changes to the underlying data generating mechanisms, and thereby result in distribution shifts across environments. Attributing performance changes to specific shifts, such as covariate or concept shifts, is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game and derive an importance weighting method for computing the value of a coalition (or a set) of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on two synthetic datasets and two real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.

Viaarxiv icon