Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Murong Yue

PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving

Jun 19, 2026

Murong Yue, Desmond Alexander Mcglone, Emily Slutz, Wenhan Lyu, Yixuan Zhang, Jennifer Suh, Ziyu Yao

Abstract:Collaborative Problem Solving (CPS) is a core skill in education, where the process of peer interaction is highly important. However, existing educational dialogue datasets mostly focus on classroom instruction or tutoring (i.e., teacher/tutor-student interaction), yet datasets centering small-group, student-student interaction are limited. This thus leaves research with limited resources for studying how students interact, coordinate, and solve problems together in real educational settings. To address this, we introduce PeerMathDial, the first dataset of peer CPS dialogues collected from authentic middle school math classrooms. It contains 55 dialogues from 27 students, totaling 6,406 turns. To facilitate research on CPS discourse analysis, we further build a corpus-grounded dialogue act taxonomy assisted by LLMs. Using the dataset and the dialogue act taxonomy, we demonstrate the practical applications of PeerMathDial across three use cases. First, we track how dialogues evolve over time and measure the impact of teacher interventions. Second, we align dialogue actions with student surveys to reveal the connection between students' traits (e.g., confidence, leadership) and their actual behaviors. Third, by evaluating LLMs on dialogue act prediction, we glimpse at the potential of LLMs for student simulation in educational applications. Our dataset and source code will be released to the community.

* 17 pages. Project website (dataset and source code): https://ziyu-yao-nlp-lab.github.io/MathVC-NSF.github.io/. Accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA) co-located at ACL 2026

Via

Access Paper or Ask Questions

ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Oct 09, 2025

Murong Yue, Zhiwei Liu, Liangwei Yang, Jianguo Zhang, Zuxin Liu, Haolin Chen, Ziyu Yao, Silvio Savarese, Caiming Xiong, Shelby Heinecke(+1 more)

Figure 1 for ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Figure 2 for ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Figure 3 for ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Figure 4 for ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Abstract:Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.

Via

Access Paper or Ask Questions

Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Apr 08, 2025

Sai Adith Senthil Kumar, Hao Yan, Saipavan Perepa, Murong Yue, Ziyu Yao

Figure 1 for Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Figure 2 for Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Figure 3 for Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Figure 4 for Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

Abstract:Large Language Models (LLMs) are now increasingly widely used to simulate personas in virtual environments, leveraging their instruction-following capability. However, we discovered that even state-of-the-art LLMs cannot simulate personas with reversed performance (e.g., student personas with low proficiency in educational settings), which impairs the simulation diversity and limits the practical applications of the simulated environments. In this work, using mathematical reasoning as a representative scenario, we propose the first benchmark dataset for evaluating LLMs on simulating personas with reversed performance, a capability that we dub "counterfactual instruction following". We evaluate both open-weight and closed-source LLMs on this task and find that LLMs, including the OpenAI o1 reasoning model, all struggle to follow counterfactual instructions for simulating reversedly performing personas. Intersectionally simulating both the performance level and the race population of a persona worsens the effect even further. These results highlight the challenges of counterfactual instruction following and the need for further research.

Via

Access Paper or Ask Questions

A Survey of Large Language Model Agents for Question Answering

Mar 24, 2025

Murong Yue

Figure 1 for A Survey of Large Language Model Agents for Question Answering

Figure 2 for A Survey of Large Language Model Agents for Question Answering

Abstract:This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.

Via

Access Paper or Ask Questions

Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack

Mar 18, 2025

Murong Yue, Ziyu Yao

Abstract:Batch prompting, which combines a batch of multiple queries sharing the same context in one inference, has emerged as a promising solution to reduce inference costs. However, our study reveals a significant security vulnerability in batch prompting: malicious users can inject attack instructions into a batch, leading to unwanted interference across all queries, which can result in the inclusion of harmful content, such as phishing links, or the disruption of logical reasoning. In this paper, we construct BATCHSAFEBENCH, a comprehensive benchmark comprising 150 attack instructions of two types and 8k batch instances, to study the batch prompting vulnerability systematically. Our evaluation of both closed-source and open-weight LLMs demonstrates that all LLMs are susceptible to batch-prompting attacks. We then explore multiple defending approaches. While the prompting-based defense shows limited effectiveness for smaller LLMs, the probing-based approach achieves about 95% accuracy in detecting attacks. Additionally, we perform a mechanistic analysis to understand the attack and identify attention heads that are responsible for it.

Via

Access Paper or Ask Questions

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Oct 04, 2024

Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, Dong Yu

Figure 1 for DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Figure 2 for DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Figure 3 for DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Figure 4 for DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Abstract:Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called "reasoning actions"), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

Via

Access Paper or Ask Questions

MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

Apr 10, 2024

Murong Yue, Wijdane Mifdal, Yixuan Zhang, Jennifer Suh, Ziyu Yao

Figure 1 for MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

Figure 2 for MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

Figure 3 for MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

Figure 4 for MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

Abstract:Mathematical modeling (MM) is considered a fundamental skill for students in STEM disciplines. Practicing the MM skill is often the most effective when students can engage in group discussion and collaborative problem-solving. However, due to unevenly distributed teachers and educational resources needed to monitor such group activities, students do not always receive equal opportunities for this practice. Excitingly, large language models (LLMs) have recently demonstrated strong capability in both modeling mathematical problems and simulating characters with different traits and properties. Drawing inspiration from the advancement of LLMs, in this work, we present MATHVC, the very first LLM-powered virtual classroom containing multiple LLM-simulated student characters, with whom a human student can practice their MM skill. To encourage each LLM character's behaviors to be aligned with their specified math-relevant properties (termed "characteristics alignment") and the overall conversational procedure to be close to an authentic student MM discussion (termed "conversational procedural alignment"), we proposed three innovations: integrating MM domain knowledge into the simulation, defining a symbolic schema as the ground for character simulation, and designing a meta planner at the platform level to drive the conversational procedure. Through experiments and ablation studies, we confirmed the effectiveness of our simulation approach and showed the promise for MATHVC to benefit real-life students in the future.

* Work in progress

Via

Access Paper or Ask Questions

Can LLM find the green circle? Investigation and Human-guided tool manipulation for compositional generalization

Dec 12, 2023

Min Zhang, Jianfeng He, Shuo Lei, Murong Yue, Linhang Wang, Chang-Tien Lu

Figure 1 for Can LLM find the green circle? Investigation and Human-guided tool manipulation for compositional generalization

Figure 2 for Can LLM find the green circle? Investigation and Human-guided tool manipulation for compositional generalization

Figure 3 for Can LLM find the green circle? Investigation and Human-guided tool manipulation for compositional generalization

Figure 4 for Can LLM find the green circle? Investigation and Human-guided tool manipulation for compositional generalization

Abstract:The meaning of complex phrases in natural language is composed of their individual components. The task of compositional generalization evaluates a model's ability to understand new combinations of components. Previous studies trained smaller, task-specific models, which exhibited poor generalization. While large language models (LLMs) exhibit impressive generalization abilities on many tasks through in-context learning (ICL), their potential for compositional generalization remains unexplored. In this paper, we first empirically investigate prevailing ICL methods in compositional generalization. We find that they struggle with complex compositional questions due to cumulative errors in long reasoning steps and intricate logic required for tool-making. Consequently, we propose a human-guided tool manipulation framework (HTM) that generates tools for sub-questions and integrates multiple tools. Our method enhances the effectiveness of tool creation and usage with minimal human effort. Experiments show that our method achieves state-of-the-art performance on two compositional generalization benchmarks and outperforms existing methods on the most challenging test split by 70%.

* Accepted by ICASSP 2024

Via

Access Paper or Ask Questions

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Oct 07, 2023

Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao

Figure 1 for Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Figure 2 for Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Figure 3 for Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Figure 4 for Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Abstract:Large language models (LLMs) such as GPT-4 have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services. In this paper, we are motivated to study building an LLM cascade to save the cost of using LLMs, particularly for performing reasoning (e.g., mathematical, causal) tasks. Our cascade pipeline follows the intuition that simpler questions can be addressed by a weaker but more affordable LLM, whereas only the challenging questions necessitate the stronger and more expensive LLM. To realize this decision-making, we consider the "answer consistency" of the weaker LLM as a signal of the question difficulty and propose several methods for the answer sampling and consistency checking, including one leveraging a mixture of two thought representations (i.e., Chain-of-Thought and Program-of-Thought). Through experiments on six reasoning benchmark datasets, with GPT-3.5-turbo and GPT-4 being the weaker and stronger LLMs, respectively, we demonstrate that our proposed LLM cascades can achieve performance comparable to using solely the stronger LLM but require only 40% of its cost.

Via

Access Paper or Ask Questions

Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Aug 08, 2023

Binfeng Xu, Xukun Liu, Hua Shen, Zeyu Han, Yuhan Li, Murong Yue, Zhiyuan Peng, Yuchen Liu, Ziyu Yao, Dongkuan Xu

Figure 1 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Figure 2 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Figure 3 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Figure 4 for Gentopia: A Collaborative Platform for Tool-Augmented LLMs

Abstract:Augmented Language Models (ALMs) empower large language models with the ability to use tools, transforming them into intelligent agents for real-world interactions. However, most existing frameworks for ALMs, to varying degrees, are deficient in the following critical features: flexible customization, collaborative democratization, and holistic evaluation. We present gentopia, an ALM framework enabling flexible customization of agents through simple configurations, seamlessly integrating various language models, task formats, prompting modules, and plugins into a unified paradigm. Furthermore, we establish gentpool, a public platform enabling the registration and sharing of user-customized agents. Agents registered in gentpool are composable such that they can be assembled together for agent collaboration, advancing the democratization of artificial intelligence. To ensure high-quality agents, gentbench, an integral component of gentpool, is designed to thoroughly evaluate user-customized agents across diverse aspects such as safety, robustness, efficiency, etc. We release gentopia on Github and will continuously move forward.

Via

Access Paper or Ask Questions