Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dorottya Demszky

TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

Oct 06, 2025

Janos Perczel, Jin Chow, Dorottya Demszky

Abstract:The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.

* 28 pages, 9 figures

Via

Access Paper or Ask Questions

Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes

May 29, 2025

Li Lucy, Camilla Griffiths, Sarah Levine, Jennifer L. Eberhardt, Dorottya Demszky, David Bamman

Abstract:Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.

* 26 pages, 7 figures, Findings of ACL 2025

Via

Access Paper or Ask Questions

From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

May 20, 2025

Ahmed Adel Attia, Dorottya Demszky, Jing Liu, Carol Espy-Wilson

Abstract:Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.

Via

Access Paper or Ask Questions

Multi-Stage Speaker Diarization for Noisy Classrooms

May 16, 2025

Ali Sartaz Khan, Tolulope Ogunremi, Ahmed Attia, Dorottya Demszky

Abstract:Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student speech and to separate all speakers. Our results show that denoising significantly improves the Diarization Error Rate (DER) by reducing the rate of missed speech. Additionally, training on both denoised and noisy datasets leads to substantial performance gains in noisy conditions. The hybrid VAD model leads to further improvements in speech detection, achieving a DER as low as 17% in teacher-student experiments and 45% in all-speaker experiments. However, we also identified trade-offs between voice activity detection and speaker confusion. Overall, our study highlights the effectiveness of multi-stage diarization models and integrating ASR-based information for enhancing speaker diarization in noisy classroom environments.

Via

Access Paper or Ask Questions

Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations

Nov 12, 2024

Rose E. Wang, Pawan Wirawarn, Kenny Lam, Omar Khattab, Dorottya Demszky

Figure 1 for Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations

Figure 2 for Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations

Figure 3 for Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations

Figure 4 for Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations

Abstract:Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present LessonLink, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR's practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.

* EMNLP 2024 Findings. Our code and dataset are open-sourced at https://github.com/rosewang2008/posr

Via

Access Paper or Ask Questions

Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

May 15, 2024

Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, Jing Liu, Carol Espy-Wilson

Figure 1 for Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Figure 2 for Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Figure 3 for Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Figure 4 for Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Abstract:Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones, classroom conditions as well as classroom demographics. Our CPT models show improved ability to generalize to different demographics unseen in the labeled finetuning data.

Via

Access Paper or Ask Questions

Backtracing: Retrieving the Cause of the Query

Mar 06, 2024

Rose E. Wang, Pawan Wirawarn, Omar Khattab, Noah Goodman, Dorottya Demszky

Figure 1 for Backtracing: Retrieving the Cause of the Query

Figure 2 for Backtracing: Retrieving the Cause of the Query

Figure 3 for Backtracing: Retrieving the Cause of the Query

Figure 4 for Backtracing: Retrieving the Cause of the Query

Abstract:Many online content portals allow users to ask questions to supplement their understanding (e.g., of lectures). While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators -- such as lecturers who want to improve their content -- identify segments that _caused_ a user to ask those questions. We introduce the task of backtracing, in which systems retrieve the text segment that most likely caused a user query. We formalize three real-world domains for which backtracing is important in improving content delivery and communication: understanding the cause of (a) student confusion in the Lecture domain, (b) reader curiosity in the News Article domain, and (c) user emotion in the Conversation domain. We evaluate the zero-shot performance of popular information retrieval methods and language modeling methods, including bi-encoder, re-ranking and likelihood-based methods and ChatGPT. While traditional IR systems retrieve semantically relevant information (e.g., details on "projection matrices" for a query "does projecting multiple times still lead to the same point?"), they often miss the causally relevant context (e.g., the lecturer states "projecting twice gets me the same answer as one projection"). Our results show that there is room for improvement on backtracing and it requires new retrieval approaches. We hope our benchmark serves to improve future retrieval systems for backtracing, spawning systems that refine content generation and identify linguistic triggers influencing user queries. Our code and data are open-sourced: https://github.com/rosewang2008/backtracing.

* Code: https://github.com/rosewang2008/backtracing; EACL 2024 Findings, Long Paper

Via

Access Paper or Ask Questions

Edu-ConvoKit: An Open-Source Library for Education Conversation Data

Feb 07, 2024

Rose E. Wang, Dorottya Demszky

Abstract:We introduce Edu-ConvoKit, an open-source library designed to handle pre-processing, annotation and analysis of conversation data in education. Resources for analyzing education conversation data are scarce, making the research challenging to perform and therefore hard to access. We address these challenges with Edu-ConvoKit. Edu-ConvoKit is open-source (https://github.com/stanfordnlp/edu-convokit ), pip-installable (https://pypi.org/project/edu-convokit/ ), with comprehensive documentation (https://edu-convokit.readthedocs.io/en/latest/ ). Our demo video is available at: https://youtu.be/zdcI839vAko?si=h9qlnl76ucSuXb8- . We include additional resources, such as Colab applications of Edu-ConvoKit to three diverse education datasets and a repository of Edu-ConvoKit related papers, that can be found in our GitHub repository.

* https://github.com/stanfordnlp/edu-convokit https://edu-convokit.readthedocs.io/en/latest/

Via

Access Paper or Ask Questions

Measuring Five Accountable Talk Moves to Improve Instruction at Scale

Nov 02, 2023

Ashlee Kupor, Candice Morgan, Dorottya Demszky

Figure 1 for Measuring Five Accountable Talk Moves to Improve Instruction at Scale

Figure 2 for Measuring Five Accountable Talk Moves to Improve Instruction at Scale

Figure 3 for Measuring Five Accountable Talk Moves to Improve Instruction at Scale

Figure 4 for Measuring Five Accountable Talk Moves to Improve Instruction at Scale

Abstract:Providing consistent, individualized feedback to teachers on their instruction can improve student learning outcomes. Such feedback can especially benefit novice instructors who teach on online platforms and have limited access to instructional training. To build scalable measures of instruction, we fine-tune RoBERTa and GPT models to identify five instructional talk moves inspired by accountable talk theory: adding on, connecting, eliciting, probing and revoicing students' ideas. We fine-tune these models on a newly annotated dataset of 2500 instructor utterances derived from transcripts of small group instruction in an online computer science course, Code in Place. Although we find that GPT-3 consistently outperforms RoBERTa in terms of precision, its recall varies significantly. We correlate the instructors' use of each talk move with indicators of student engagement and satisfaction, including students' section attendance, section ratings, and assignment completion rates. We find that using talk moves generally correlates positively with student outcomes, and connecting student ideas has the largest positive impact. These results corroborate previous research on the effectiveness of accountable talk moves and provide exciting avenues for using these models to provide instructors with useful, scalable feedback.

Via

Access Paper or Ask Questions

Step-by-Step Remediation of Students' Mathematical Mistakes

Oct 16, 2023

Rose E. Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, Dorottya Demszky

Figure 1 for Step-by-Step Remediation of Students' Mathematical Mistakes

Figure 2 for Step-by-Step Remediation of Students' Mathematical Mistakes

Figure 3 for Step-by-Step Remediation of Students' Mathematical Mistakes

Figure 4 for Step-by-Step Remediation of Students' Mathematical Mistakes

Abstract:Scaling high-quality tutoring is a major challenge in education. Because of the growing demand, many platforms employ novice tutors who, unlike professional educators, struggle to effectively address student mistakes and thus fail to seize prime learning opportunities for students. In this paper, we explore the potential for large language models (LLMs) to assist math tutors in remediating student mistakes. We present ReMath, a benchmark co-developed with experienced math teachers that deconstructs their thought process for remediation. The benchmark consists of three step-by-step tasks: (1) infer the type of student error, (2) determine the strategy to address the error, and (3) generate a response that incorporates that information. We evaluate the performance of state-of-the-art instruct-tuned and dialog models on ReMath. Our findings suggest that although models consistently improve upon original tutor responses, we cannot rely on models alone to remediate mistakes. Providing models with the error type (e.g., the student is guessing) and strategy (e.g., simplify the problem) leads to a 75% improvement in the response quality over models without that information. Nonetheless, despite the improvement, the quality of the best model's responses still falls short of experienced math teachers. Our work sheds light on the potential and limitations of using current LLMs to provide high-quality learning experiences for both tutors and students at scale. Our work is open-sourced at this link: \url{https://github.com/rosewang2008/remath}.

* Our work is open-sourced at this link: \url{https://github.com/rosewang2008/remath}

Via

Access Paper or Ask Questions