Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinho D. Choi

Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Apr 17, 2025

Nayoung Choi, Peace Cyebukayire, Jinho D. Choi

Figure 1 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Figure 2 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Figure 3 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Figure 4 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Abstract:This paper presents Tinker Tales, an interactive storytelling framework in the format of a board game, designed to support both narrative development and AI literacy in early childhood. The framework integrates tangible and speech-based interactions with AI through NFC chip-attached pawns and tokens, along with a speaker and microphone. Children select and define key story elements-such as characters, places, items, and emotions-using the pawns and tokens, providing further details to the AI and receiving proper assistance, similar to how adults prompt AI for specific tasks (e.g., writing). For evaluation, several game sessions were simulated with a child AI agent, and the quality and safety of the generated stories were assessed from various perspectives. This work highlights the potential of combining physical and digital elements in AI literacy, offering a safe and engaging way for children to learn how to effectively collaborate with AI.

Via

Access Paper or Ask Questions

Trustworthy Answers, Messier Data: Bridging the Gap in Low-Resource Retrieval-Augmented Generation for Domain Expert Systems

Feb 26, 2025

Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi

Abstract:RAG has become a key technique for enhancing LLMs by reducing hallucinations, especially in domain expert systems where LLMs may lack sufficient inherent knowledge. However, developing these systems in low-resource settings introduces several challenges: (1) handling heterogeneous data sources, (2) optimizing retrieval phase for trustworthy answers, and (3) evaluating generated answers across diverse aspects. To address these, we introduce a data generation pipeline that transforms raw multi-modal data into structured corpus and Q&A pairs, an advanced re-ranking phase improving retrieval precision, and a reference matching algorithm enhancing answer traceability. Applied to the automotive engineering domain, our system improves factual correctness (+1.94), informativeness (+1.16), and helpfulness (+1.67) over a non-RAG baseline, based on a 1-5 scale by an LLM judge. These results highlight the effectiveness of our approach across distinct aspects, with strong answer grounding and transparency.

Via

Access Paper or Ask Questions

Finding A Voice: Evaluating African American Dialect Generation for Chatbot Technology

Jan 07, 2025

Sarah E. Finch, Ellie S. Paek, Sejung Kwon, Ikseon Choi, Jessica Wells, Rasheeta Chandler, Jinho D. Choi

Abstract:As chatbots become increasingly integrated into everyday tasks, designing systems that accommodate diverse user populations is crucial for fostering trust, engagement, and inclusivity. This study investigates the ability of contemporary Large Language Models (LLMs) to generate African American Vernacular English (AAVE) and evaluates the impact of AAVE usage on user experiences in chatbot applications. We analyze the performance of three LLM families (Llama, GPT, and Claude) in producing AAVE-like utterances at varying dialect intensities and assess user preferences across multiple domains, including healthcare and education. Despite LLMs' proficiency in generating AAVE-like language, findings indicate that AAVE-speaking users prefer Standard American English (SAE) chatbots, with higher levels of AAVE correlating with lower ratings for a variety of characteristics, including chatbot trustworthiness and role appropriateness. These results highlight the complexities of creating inclusive AI systems and underscore the need for further exploration of diversity to enhance human-computer interactions.

Via

Access Paper or Ask Questions

Transforming Slot Schema Induction with Generative Dialogue State Inference

Aug 03, 2024

James D. Finch, Boxin Zhao, Jinho D. Choi

Figure 1 for Transforming Slot Schema Induction with Generative Dialogue State Inference

Figure 2 for Transforming Slot Schema Induction with Generative Dialogue State Inference

Figure 3 for Transforming Slot Schema Induction with Generative Dialogue State Inference

Figure 4 for Transforming Slot Schema Induction with Generative Dialogue State Inference

Abstract:The challenge of defining a slot schema to represent the state of a task-oriented dialogue system is addressed by Slot Schema Induction (SSI), which aims to automatically induce slots from unlabeled dialogue data. Whereas previous approaches induce slots by clustering value spans extracted directly from the dialogue text, we demonstrate the power of discovering slots using a generative approach. By training a model to generate slot names and values that summarize key dialogue information with no prior task knowledge, our SSI method discovers high-quality candidate information for representing dialogue state. These discovered slot-value candidates can be easily clustered into unified slot schemas that align well with human-authored schemas. Experimental comparisons on the MultiWOZ and SGD datasets demonstrate that Generative Dialogue State Inference (GenDSI) outperforms the previous state-of-the-art on multiple aspects of the SSI task.

* Accepted to SIGDIAL 2024

Via

Access Paper or Ask Questions

ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Jul 10, 2024

Benjamin Ascoli, Ram Kandikonda, Jinho D. Choi

Figure 1 for ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Figure 2 for ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Figure 3 for ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Figure 4 for ESM+: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Abstract:The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. Despite several challenges, recent models have made remarkable advancements in this task using large language models (LLMs). Interestingly, we find that LLM-based models without fine-tuning exhibit distinct natures compared to their fine-tuned counterparts, leading to inadequacies in current evaluation metrics to accurately convey their performance. Thus, we analyze the two primary metrics, Test Suite Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM), to examine their robustness for this task and address shortcomings. We compare the performance of 9 LLM-based models using EXE, the original ESM, and our improved ESM (called ESM+). Our results show that EXE and ESM have high false positive and negative rates of 11.3% and 13.9%, while ESM+ gives those of 0.1% and 2.6% respectively, providing a significantly more stable evaluation. We release the ESM+ script as open-source for the community to contribute, while enjoying a more reliable assessment of Text-to-SQL.

Via

Access Paper or Ask Questions

Leveraging Explicit Reasoning for Inference Integration in Commonsense-Augmented Dialogue Models

Jun 13, 2024

Sarah E. Finch, Jinho D. Choi

Abstract:Open-domain dialogue systems need to grasp social commonsense to understand and respond effectively to human users. Commonsense-augmented dialogue models have been proposed that aim to infer commonsense knowledge from dialogue contexts in order to improve response quality. However, existing approaches to commonsense-augmented dialogue rely on implicit reasoning to integrate commonsense inferences during response generation. In this study, we explore the impact of explicit reasoning against implicit reasoning over commonsense for dialogue response generation. Our findings demonstrate that separating commonsense reasoning into explicit steps for generating, selecting, and integrating commonsense into responses leads to better dialogue interactions, improving naturalness, engagement, specificity, and overall quality. Subsequent analyses of these findings unveil insights into the effectiveness of various types of commonsense in generating responses and the particular response traits enhanced through explicit reasoning for commonsense integration. Our work advances research in open-domain dialogue by achieving a new state-of-the-art in commonsense-augmented response generation.

Via

Access Paper or Ask Questions

Leveraging Diverse Data Generation for Adaptable Zero-Shot Dialogue State Tracking

May 21, 2024

James D. Finch, Boxin Zhao, Jinho D. Choi

Figure 1 for Leveraging Diverse Data Generation for Adaptable Zero-Shot Dialogue State Tracking

Figure 2 for Leveraging Diverse Data Generation for Adaptable Zero-Shot Dialogue State Tracking

Figure 3 for Leveraging Diverse Data Generation for Adaptable Zero-Shot Dialogue State Tracking

Figure 4 for Leveraging Diverse Data Generation for Adaptable Zero-Shot Dialogue State Tracking

Abstract:This work demonstrates that substantial gains in zero-shot dialogue state tracking (DST) accuracy can be achieved by increasing the diversity of training data using synthetic data generation techniques. Current DST training resources are severely limited in the number of application domains and slot types they cover due to the high costs of data collection, resulting in limited adaptability to new domains. The presented work overcomes this challenge using a novel, fully automatic data generation approach to create synthetic zero-shot DST training resources. Unlike previous approaches for generating DST data, the presented approach generates entirely new application domains to generate dialogues, complete with silver dialogue state annotations and slot descriptions. This approach is used to create the D0T dataset for training zero-shot DST models, which covers an unprecedented 1,000+ domains. Experiments performed on the MultiWOZ benchmark indicate that training models on diverse synthetic data yields a performance improvement of +6.7% Joint Goal Accuracy, achieving results competitive with much larger models.

Via

Access Paper or Ask Questions

Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments

May 18, 2024

Sichang Tu, Abigail Powers, Natalie Merrill, Negar Fani, Sierra Carter, Stephen Doogan, Jinho D. Choi

Figure 1 for Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments

Figure 2 for Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments

Figure 3 for Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments

Figure 4 for Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments

Abstract:The shortage of clinical workforce presents significant challenges in mental healthcare, limiting access to formal diagnostics and services. We aim to tackle this shortage by integrating a customized large language model (LLM) into the workflow, thus promoting equity in mental healthcare for the general population. Although LLMs have showcased their capability in clinical decision-making, their adaptation to severe conditions like Post-traumatic Stress Disorder (PTSD) remains largely unexplored. Therefore, we collect 411 clinician-administered diagnostic interviews and devise a novel approach to obtain high-quality data. Moreover, we build a comprehensive framework to automate PTSD diagnostic assessments based on interview contents by leveraging two state-of-the-art LLMs, GPT-4 and Llama-2, with potential for broader clinical diagnoses. Our results illustrate strong promise for LLMs, tested on our dataset, to aid clinicians in diagnostic validation. To the best of our knowledge, this is the first AI system that fully automates assessments for mental illness based on clinician-administered interviews.

Via

Access Paper or Ask Questions

What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Apr 09, 2024

Jeongrok Yu, Seong Ug Kim, Jacob Choi, Jinho D. Choi

Figure 1 for What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Figure 2 for What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Figure 3 for What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Figure 4 for What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Abstract:Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (MLMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper proposes a multilingual approach to estimate gender bias in MLMs from 5 languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias, compared to the traditional lexicon-based method. For each language, both the lexicon-based and model-based methods are applied to create two datasets respectively, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and 3 new scoring metrics. Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice.

Via

Access Paper or Ask Questions

Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model

Feb 20, 2024

Liyan Xu, Zhenlin Su, Mo Yu, Jin Xu, Jinho D. Choi, Jie Zhou, Fei Liu

Figure 1 for Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model

Figure 2 for Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model

Figure 3 for Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model

Figure 4 for Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model

Abstract:Factual inconsistency poses a significant hurdle for the commercial deployment of abstractive summarizers. Under this Large Language Model (LLM) era, this work focuses around two important questions: what is the best way to leverage LLM for factual inconsistency detection, and how could we distill a smaller LLM with both high efficiency and efficacy? Three zero-shot paradigms are firstly proposed and evaluated across five diverse datasets: direct inference on the entire summary or each summary window; entity verification through question generation and answering. Experiments suggest that LLM itself is capable to resolve this task train-free under the proper paradigm design, surpassing strong trained baselines by 2.8% on average. To further promote practical utility, we then propose training strategies aimed at distilling smaller open-source LLM that learns to score the entire summary at once with high accuracy, which outperforms the zero-shot approaches by much larger LLM, serving as an effective and efficient ready-to-use scorer.

Via

Access Paper or Ask Questions