Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Quang Hieu Pham

ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

Jun 04, 2025

Quang Hieu Pham, Thuy Duong Nguyen, Tung Pham, Anh Tuan Luu, Dat Quoc Nguyen

Abstract:The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.

* Accepted to ACL 2025 Findings

Via

Access Paper or Ask Questions

Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Oct 21, 2024

Quang Hieu Pham, Hoang Ngo, Anh Tuan Luu, Dat Quoc Nguyen

Figure 1 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Figure 2 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Figure 3 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Figure 4 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.

* Accepted to EMNLP 2024 Findings

Via

Access Paper or Ask Questions