Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Desmond C. Ong

Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers

Apr 25, 2025

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, Nick Haber

Abstract:Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to *replace* mental health providers, a use case promoted in the tech startup and research space. We conduct a mapping review of therapy guides used by major medical institutions to identify crucial aspects of therapeutic relationships, such as the importance of a therapeutic alliance between therapist and client. We then assess the ability of LLMs to reproduce and adhere to these aspects of therapeutic relationships by conducting several experiments investigating the responses of current LLMs, such as `gpt-4o`. Contrary to best practices in the medical community, LLMs 1) express stigma toward those with mental health conditions and 2) respond inappropriately to certain common (and critical) conditions in naturalistic therapy settings -- e.g., LLMs encourage clients' delusional thinking, likely due to their sycophancy. This occurs even with larger and newer LLMs, indicating that current safety practices may not address these gaps. Furthermore, we note foundational and practical barriers to the adoption of LLMs as therapists, such as that a therapeutic alliance requires human characteristics (e.g., identity and stakes). For these reasons, we conclude that LLMs should not replace therapists, and we discuss alternative roles for LLMs in clinical therapy.

Via

Access Paper or Ask Questions

Modeling Subjectivity in Cognitive Appraisal with Language Models

Mar 14, 2025

Yuxiang Zhou, Hainiu Xu, Desmond C. Ong, Petr Slovak, Yulan He

Abstract:As the utilization of language models in interdisciplinary, human-centered studies grow, the expectation of model capabilities continues to evolve. Beyond excelling at conventional tasks, models are recently expected to perform well on user-centric measurements involving confidence and human (dis)agreement -- factors that reflect subjective preferences. While modeling of subjectivity plays an essential role in cognitive science and has been extensively studied, it remains under-explored within the NLP community. In light of this gap, we explore how language models can harness subjectivity by conducting comprehensive experiments and analysis across various scenarios using both fine-tuned models and prompt-based large language models (LLMs). Our quantitative and qualitative experimental results indicate that existing post-hoc calibration approaches often fail to produce satisfactory results. However, our findings reveal that personality traits and demographical information are critical for measuring subjectivity. Furthermore, our in-depth analysis offers valuable insights for future research and development in the interdisciplinary studies of NLP and cognitive science.

Via

Access Paper or Ask Questions

Human-like Affective Cognition in Foundation Models

Sep 19, 2024

Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman

Abstract:Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other affective cognition. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.

Via

Access Paper or Ask Questions

GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science?

Jun 13, 2024

Desmond C. Ong

Figure 1 for GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science?

Abstract:Large Language Models have taken the cognitive science world by storm. It is perhaps timely now to take stock of the various research paradigms that have been used to make scientific inferences about ``cognition" in these models or about human cognition. We review several emerging research paradigms -- GPT-ology, LLMs-as-computational-models, and ``silicon sampling" -- and review recent papers that have used LLMs under these paradigms. In doing so, we discuss their claims as well as challenges to scientific inference under these various paradigms. We highlight several outstanding issues about LLMs that have to be addressed to push our science forward: closed-source vs open-sourced models; (the lack of visibility of) training data; and reproducibility in LLM research, including forming conventions on new task ``hyperparameters" like instructions and prompts.

* CogSci 2024; 6 pages + 2 page of references

Via

Access Paper or Ask Questions

Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided

Apr 01, 2024

Hongli Zhan, Allen Zheng, Yoon Kyung Lee, Jina Suh, Junyi Jessy Li, Desmond C. Ong

Figure 1 for Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided

Figure 2 for Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided

Figure 3 for Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided

Figure 4 for Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided

Abstract:Large language models (LLMs) have offered new opportunities for emotional support, and recent work has shown that they can produce empathic responses to people in distress. However, long-term mental well-being requires emotional self-regulation, where a one-time empathic response falls short. This work takes a first step by engaging with cognitive reappraisals, a strategy from psychology practitioners that uses language to targetedly change negative appraisals that an individual makes of the situation; such appraisals is known to sit at the root of human emotional experience. We hypothesize that psychologically grounded principles could enable such advanced psychology capabilities in LLMs, and design RESORT which consists of a series of reappraisal constitutions across multiple dimensions that can be used as LLM instructions. We conduct a first-of-its-kind expert evaluation (by clinical psychologists with M.S. or Ph.D. degrees) of an LLM's zero-shot ability to generate cognitive reappraisal responses to medium-length social media messages asking for support. This fine-grained evaluation showed that even LLMs at the 7B scale guided by RESORT are capable of generating empathic responses that can help users reappraise their situations.

Via

Access Paper or Ask Questions

Large Language Models Produce Responses Perceived to be Empathic

Mar 26, 2024

Yoon Kyung Lee, Jina Suh, Hongli Zhan, Junyi Jessy Li, Desmond C. Ong

Abstract:Large Language Models (LLMs) have demonstrated surprising performance on many tasks, including writing supportive messages that display empathy. Here, we had these models generate empathic messages in response to posts describing common life experiences, such as workplace situations, parenting, relationships, and other anxiety- and anger-eliciting situations. Across two studies (N=192, 202), we showed human raters a variety of responses written by several models (GPT4 Turbo, Llama2, and Mistral), and had people rate these responses on how empathic they seemed to be. We found that LLM-generated responses were consistently rated as more empathic than human-written responses. Linguistic analyses also show that these models write in distinct, predictable ``styles", in terms of their use of punctuation, emojis, and certain words. These results highlight the potential of using LLMs to enhance human peer support in contexts where empathy is important.

Via

Access Paper or Ask Questions

Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models

Oct 22, 2023

Hongli Zhan, Desmond C. Ong, Junyi Jessy Li

Abstract:The emotions we experience involve complex processes; besides physiological aspects, research in psychology has studied cognitive appraisals where people assess their situations subjectively, according to their own values (Scherer, 2005). Thus, the same situation can often result in different emotional experiences. While the detection of emotion is a well-established task, there is very limited work so far on the automatic prediction of cognitive appraisals. This work fills the gap by presenting CovidET-Appraisals, the most comprehensive dataset to-date that assesses 24 appraisal dimensions, each with a natural language rationale, across 241 Reddit posts. CovidET-Appraisals presents an ideal testbed to evaluate the ability of large language models -- excelling at a wide range of NLP tasks -- to automatically assess and explain cognitive appraisals. We found that while the best models are performant, open-sourced LLMs fall short at this task, presenting a new challenge in the future development of emotionally intelligent models. We release our dataset at https://github.com/honglizhan/CovidET-Appraisals-Public.

* EMNLP 2023 (Findings) Camera-Ready Version

Via

Access Paper or Ask Questions

Using Positive Matching Contrastive Loss with Facial Action Units to mitigate bias in Facial Expression Recognition

Mar 08, 2023

Varsha Suresh, Desmond C. Ong

Abstract:Machine learning models automatically learn discriminative features from the data, and are therefore susceptible to learn strongly-correlated biases, such as using protected attributes like gender and race. Most existing bias mitigation approaches aim to explicitly reduce the model's focus on these protected features. In this work, we propose to mitigate bias by explicitly guiding the model's focus towards task-relevant features using domain knowledge, and we hypothesize that this can indirectly reduce the dependence of the model on spurious correlations it learns from the data. We explore bias mitigation in facial expression recognition systems using facial Action Units (AUs) as the task-relevant feature. To this end, we introduce Feature-based Positive Matching Contrastive Loss which learns the distances between the positives of a sample based on the similarity between their corresponding AU embeddings. We compare our approach with representative baselines and show that incorporating task-relevant features via our method can improve model fairness at minimal cost to classification performance.

* 10th International Conference on Affective Computing and Intelligent Interaction (ACII), 2022

Via

Access Paper or Ask Questions

ReaSCAN: Compositional Reasoning in Language Grounding

Sep 18, 2021

Zhengxuan Wu, Elisa Kreiss, Desmond C. Ong, Christopher Potts

Figure 1 for ReaSCAN: Compositional Reasoning in Language Grounding

Figure 2 for ReaSCAN: Compositional Reasoning in Language Grounding

Figure 3 for ReaSCAN: Compositional Reasoning in Language Grounding

Figure 4 for ReaSCAN: Compositional Reasoning in Language Grounding

Abstract:The ability to compositionally map language to referents, relations, and actions is an essential component of language understanding. The recent gSCAN dataset (Ruis et al. 2020, NeurIPS) is an inspiring attempt to assess the capacity of models to learn this kind of grounding in scenarios involving navigational instructions. However, we show that gSCAN's highly constrained design means that it does not require compositional interpretation and that many details of its instructions and scenarios are not required for task success. To address these limitations, we propose ReaSCAN, a benchmark dataset that builds off gSCAN but requires compositional language interpretation and reasoning about entities and relations. We assess two models on ReaSCAN: a multi-modal baseline and a state-of-the-art graph convolutional neural model. These experiments show that ReaSCAN is substantially harder than gSCAN for both neural architectures. This suggests that ReaSCAN can serve as a valuable benchmark for advancing our understanding of models' compositional generalization and reasoning capabilities.

* NeurIPS 2021 (Dataset and Benchmarks Track)
* 26 pages, 8 figures

Via

Access Paper or Ask Questions

Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification

Sep 12, 2021

Varsha Suresh, Desmond C. Ong

Figure 1 for Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification

Figure 2 for Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification

Figure 3 for Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification

Figure 4 for Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification

Abstract:Fine-grained classification involves dealing with datasets with larger number of classes with subtle differences between them. Guiding the model to focus on differentiating dimensions between these commonly confusable classes is key to improving performance on fine-grained tasks. In this work, we analyse the contrastive fine-tuning of pre-trained language models on two fine-grained text classification tasks, emotion classification and sentiment analysis. We adaptively embed class relationships into a contrastive objective function to help differently weigh the positives and negatives, and in particular, weighting closely confusable negatives more than less similar negative examples. We find that Label-aware Contrastive Loss outperforms previous contrastive methods, in the presence of larger number and/or more confusable classes, and helps models to produce output distributions that are more differentiated.

* Accepted at EMNLP 2021

Via

Access Paper or Ask Questions