Topic


Personalized Benchmarking: Evaluating LLMs by Individual Preferences

Add code
Apr 21, 2026
Viaarxiv icon

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Add code
Apr 21, 2026
Viaarxiv icon

ltzGLUE: Luxembourgish General Language Understanding Evaluation

Add code
Apr 20, 2026
Viaarxiv icon

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Add code
Apr 20, 2026
Viaarxiv icon

Retrieval-Augmented Multimodal Model for Fake News Detection

Add code
Apr 20, 2026
Viaarxiv icon

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Add code
Apr 20, 2026
Viaarxiv icon

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Add code
Apr 20, 2026
Viaarxiv icon

Physics-Informed Neural Networks: A Didactic Derivation of the Complete Training Cycle

Add code
Apr 20, 2026
Viaarxiv icon

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Add code
Apr 20, 2026
Viaarxiv icon

Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation

Add code
Apr 20, 2026
Viaarxiv icon