GPT-4


InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Add code
Nov 03, 2025
Viaarxiv icon

Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

Add code
Oct 30, 2025
Viaarxiv icon

Leveraging Hierarchical Organization for Medical Multi-document Summarization

Add code
Oct 27, 2025
Viaarxiv icon

A Comprehensive Dataset for Human vs. AI Generated Text Detection

Add code
Oct 26, 2025
Viaarxiv icon

A Coherence-Based Measure of AGI

Add code
Oct 23, 2025
Viaarxiv icon

Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization

Add code
Oct 22, 2025
Viaarxiv icon

Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning

Add code
Oct 06, 2025
Viaarxiv icon

Slm-mux: Orchestrating small language models for reasoning

Add code
Oct 06, 2025
Viaarxiv icon

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Add code
Oct 02, 2025
Viaarxiv icon

PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Add code
Sep 26, 2025
Viaarxiv icon