Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Federico Solda

Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

May 19, 2025

Ming Ding, Rasmus Kyng, Federico Solda, Weixuan Yuan

Figure 1 for Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Figure 2 for Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Figure 3 for Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Figure 4 for Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Abstract:As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.

Via

Access Paper or Ask Questions