Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lyle Ungar

University of Pennsylvania

DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity

Aug 30, 2024

Xiaoyu Lin, Xinkai Yu, Ankit Aich, Salvatore Giorgi, Lyle Ungar

Abstract:Large Language Models (LLMs), which simulate human users, are frequently employed to evaluate chatbots in applications such as tutoring and customer service. Effective evaluation necessitates a high degree of human-like diversity within these simulations. In this paper, we demonstrate that conversations generated by GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features. These features include topic variation, lexical attributes, and both the average behavior and diversity (variance) of the language used. To address these discrepancies, we propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions, such as age, gender, emotional tone, and the topics discussed. We assess our approach using differential language analysis combined with deep linguistic inquiry. Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements. Specifically, it enhances the human-likeness of LLM chatbot conversations, increasing their linguistic diversity. On average, we observe a 54 percent reduction in the error of average features between human and LLM-generated conversations. This method of constructing chatbot sets with human-like diversity holds great potential for enhancing the evaluation process of user-facing bots.

Via

Access Paper or Ask Questions

Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

Jun 18, 2024

Ankit Aich, Tingting Liu, Salvatore Giorgi, Kelsey Isman, Lyle Ungar, Brenda Curtis

Figure 1 for Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

Figure 2 for Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

Figure 3 for Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

Figure 4 for Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

Abstract:Large Language Models (LLMs) are increasingly being used in educational and learning applications. Research has demonstrated that controlling for style, to fit the needs of the learner, fosters increased understanding, promotes inclusion, and helps with knowledge distillation. To understand the capabilities and limitations of contemporary LLMs in style control, we evaluated five state-of-the-art models: GPT-3.5, GPT-4, GPT-4o, Llama-3, and Mistral-instruct- 7B across two style control tasks. We observed significant inconsistencies in the first task, with model performances averaging between 5th and 8th grade reading levels for tasks intended for first-graders, and standard deviations up to 27.6. For our second task, we observed a statistically significant improvement in performance from 0.02 to 0.26. However, we find that even without stereotypes in reference texts, LLMs often generated culturally insensitive content during their tasks. We provide a thorough analysis and discussion of the results.

Via

Access Paper or Ask Questions

Building Knowledge-Guided Lexica to Model Cultural Variation

Jun 17, 2024

Shreya Havaldar, Salvatore Giorgi, Sunny Rai, Thomas Talhelm, Sharath Chandra Guntuku, Lyle Ungar

Figure 1 for Building Knowledge-Guided Lexica to Model Cultural Variation

Figure 2 for Building Knowledge-Guided Lexica to Model Cultural Variation

Figure 3 for Building Knowledge-Guided Lexica to Model Cultural Variation

Figure 4 for Building Knowledge-Guided Lexica to Model Cultural Variation

Abstract:Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs' failure to measure cultural variation or generate culturally varied language.

* Accepted at NAACL 2024

Via

Access Paper or Ask Questions

Empirical influence functions to understand the logic of fine-tuning

Jun 01, 2024

Jordan K. Matelsky, Lyle Ungar, Konrad P. Kording

Figure 1 for Empirical influence functions to understand the logic of fine-tuning

Figure 2 for Empirical influence functions to understand the logic of fine-tuning

Figure 3 for Empirical influence functions to understand the logic of fine-tuning

Figure 4 for Empirical influence functions to understand the logic of fine-tuning

Abstract:Understanding the process of learning in neural networks is crucial for improving their performance and interpreting their behavior. This can be approximately understood by asking how a model's output is influenced when we fine-tune on a new training sample. There are desiderata for such influences, such as decreasing influence with semantic distance, sparseness, noise invariance, transitive causality, and logical consistency. Here we use the empirical influence measured using fine-tuning to demonstrate how individual training samples affect outputs. We show that these desiderata are violated for both for simple convolutional networks and for a modern LLM. We also illustrate how prompting can partially rescue this failure. Our paper presents an efficient and practical way of quantifying how well neural networks learn from fine-tuning stimuli. Our results suggest that popular models cannot generalize or perform logic in the way they appear to.

Via

Access Paper or Ask Questions

Language-based Valence and Arousal Expressions between the United States and China: a Cross-Cultural Examination

Jan 11, 2024

Young-Min Cho, Dandan Pang, Stuti Thapa, Garrick Sherman, Lyle Ungar, Louis Tay, Sharath Chandra Guntuku

Abstract:Although affective expressions of individuals have been extensively studied using social media, research has primarily focused on the Western context. There are substantial differences among cultures that contribute to their affective expressions. This paper examines the differences between Twitter (X) in the United States and Sina Weibo posts in China on two primary dimensions of affect - valence and arousal. We study the difference in the functional relationship between arousal and valence (so-called V-shaped) among individuals in the US and China and explore the associated content differences. Furthermore, we correlate word usage and topics in both platforms to interpret their differences. We observe that for Twitter users, the variation in emotional intensity is less distinct between negative and positive emotions compared to Weibo users, and there is a sharper escalation in arousal corresponding with heightened emotions. From language features, we discover that affective expressions are associated with personal life and feelings on Twitter, while on Weibo such discussions are about socio-political topics in the society. These results suggest a West-East difference in the V-shaped relationship between valence and arousal of affective expressions on social media influenced by content differences. Our findings have implications for applications and theories related to cultural differences in affective expressions.

Via

Access Paper or Ask Questions

Personalized Assignment to One of Many Treatment Arms via Regularized and Clustered Joint Assignment Forests

Nov 01, 2023

Rahul Ladhania, Jann Spiess, Lyle Ungar, Wenbo Wu

Abstract:We consider learning personalized assignments to one of many treatment arms from a randomized controlled trial. Standard methods that estimate heterogeneous treatment effects separately for each arm may perform poorly in this case due to excess variance. We instead propose methods that pool information across treatment arms: First, we consider a regularized forest-based assignment algorithm based on greedy recursive partitioning that shrinks effect estimates across arms. Second, we augment our algorithm by a clustering scheme that combines treatment arms with consistently similar outcomes. In a simulation study, we compare the performance of these approaches to predicting arm-wise outcomes separately, and document gains of directly optimizing the treatment assignment with regularization and clustering. In a theoretical model, we illustrate how a high number of treatment arms makes finding the best arm hard, while we can achieve sizable utility gains from personalization by regularized optimization.

Via

Access Paper or Ask Questions

An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives

Oct 25, 2023

Young Min Cho, Sunny Rai, Lyle Ungar, João Sedoc, Sharath Chandra Guntuku

Figure 1 for An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives

Figure 2 for An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives

Figure 3 for An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives

Figure 4 for An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives

Abstract:Mental health conversational agents (a.k.a. chatbots) are widely studied for their potential to offer accessible support to those experiencing mental health challenges. Previous surveys on the topic primarily consider papers published in either computer science or medicine, leading to a divide in understanding and hindering the sharing of beneficial knowledge between both domains. To bridge this gap, we conduct a comprehensive literature review using the PRISMA framework, reviewing 534 papers published in both computer science and medicine. Our systematic review reveals 136 key papers on building mental health-related conversational agents with diverse characteristics of modeling and experimental design techniques. We find that computer science papers focus on LLM techniques and evaluating response quality using automated metrics with little attention to the application while medical papers use rule-based conversational agents and outcome metrics to measure the health outcomes of participants. Based on our findings on transparency, ethics, and cultural heterogeneity in this review, we provide a few recommendations to help bridge the disciplinary divide and enable the cross-disciplinary development of mental health conversational agents.

* Accepted in EMNLP 2023 Main Conference, camera ready

Via

Access Paper or Ask Questions

Comparing Styles across Languages

Oct 11, 2023

Shreya Havaldar, Matthew Pressimone, Eric Wong, Lyle Ungar

Figure 1 for Comparing Styles across Languages

Figure 2 for Comparing Styles across Languages

Figure 3 for Comparing Styles across Languages

Figure 4 for Comparing Styles across Languages

Abstract:Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.

* To appear in EMNLP 2023

Via

Access Paper or Ask Questions

Multilingual Language Models are not Multicultural: A Case Study in Emotion

Jul 09, 2023

Shreya Havaldar, Sunny Rai, Bhumika Singhal, Langchen Liu, Sharath Chandra Guntuku, Lyle Ungar

Figure 1 for Multilingual Language Models are not Multicultural: A Case Study in Emotion

Figure 2 for Multilingual Language Models are not Multicultural: A Case Study in Emotion

Figure 3 for Multilingual Language Models are not Multicultural: A Case Study in Emotion

Figure 4 for Multilingual Language Models are not Multicultural: A Case Study in Emotion

Abstract:Emotions are experienced and expressed differently across the world. In order to use Large Language Models (LMs) for multilingual tasks that require emotional sensitivity, LMs must reflect this cultural variation in emotion. In this study, we investigate whether the widely-used multilingual LMs in 2023 reflect differences in emotional expressions across cultures and languages. We find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs do not successfully learn the culturally appropriate nuances of emotion and we highlight possible research directions towards correcting this.

* Accepted to WASSA at ACL 2023

Via

Access Paper or Ask Questions

TopEx: Topic-based Explanations for Model Comparison

Jun 02, 2023

Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar

Figure 1 for TopEx: Topic-based Explanations for Model Comparison

Figure 2 for TopEx: Topic-based Explanations for Model Comparison

Figure 3 for TopEx: Topic-based Explanations for Model Comparison

Figure 4 for TopEx: Topic-based Explanations for Model Comparison

Abstract:Meaningfully comparing language models is challenging with current explanation methods. Current explanations are overwhelming for humans due to large vocabularies or incomparable across models. We present TopEx, an explanation method that enables a level playing field for comparing language models via model-agnostic topics. We demonstrate how TopEx can identify similarities and differences between DistilRoBERTa and GPT-2 on a variety of NLP tasks.

* Accepted to ICLR 2023, Tiny Papers Track

Via

Access Paper or Ask Questions