Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.
When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.
The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that "reason" using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.
While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.
Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.
Healthcare systems around the world are grappling with issues like inefficient diagnostics, rising costs, and limited access to specialists. These problems often lead to delays in treatment and poor health outcomes. Most current AI and deep learning diagnostic systems are not very interactive or transparent, making them less effective in real-world, patient-centered environments. This research introduces a diagnostic chatbot powered by a Large Language Model (LLM), using GPT-4o, Retrieval-Augmented Generation, and explainable AI techniques. The chatbot engages patients in a dynamic conversation, helping to extract and normalize symptoms while prioritizing potential diagnoses through similarity matching and adaptive questioning. With Chain-of-Thought prompting, the system also offers more transparent reasoning behind its diagnoses. When tested against traditional machine learning models like Naive Bayes, Logistic Regression, SVM, Random Forest, and KNN, the LLM-based system delivered impressive results, achieving an accuracy of 90% and Top-3 accuracy of 100%. These findings offer a promising outlook for more transparent, interactive, and clinically relevant AI in healthcare.
General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.
As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.
This research presents the implementation of a Sharia-compliant chatbot as an interactive medium for consulting Islamic questions, leveraging Reinforcement Learning (Q-Learning) integrated with Sentence-Transformers for semantic embedding to ensure contextual and accurate responses. Utilizing the CRISP-DM methodology, the system processes a curated Islam QA dataset of 25,000 question-answer pairs from authentic sources like the Qur'an, Hadith, and scholarly fatwas, formatted in JSON for flexibility and scalability. The chatbot prototype, developed with a Flask API backend and Flutter-based mobile frontend, achieves 87% semantic accuracy in functional testing across diverse topics including fiqh, aqidah, ibadah, and muamalah, demonstrating its potential to enhance religious literacy, digital da'wah, and access to verified Islamic knowledge in the Industry 4.0 era. While effective for closed-domain queries, limitations such as static learning and dataset dependency highlight opportunities for future enhancements like continuous adaptation and multi-turn conversation support, positioning this innovation as a bridge between traditional Islamic scholarship and modern AI-driven consultation.
Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.