Picture for Yujun Zhou

Yujun Zhou

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Add code
Oct 02, 2025
Figure 1 for CLUE: Non-parametric Verification from Experience via Hidden-State Clustering
Figure 2 for CLUE: Non-parametric Verification from Experience via Hidden-State Clustering
Figure 3 for CLUE: Non-parametric Verification from Experience via Hidden-State Clustering
Figure 4 for CLUE: Non-parametric Verification from Experience via Hidden-State Clustering
Viaarxiv icon

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Add code
Sep 18, 2025
Viaarxiv icon

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Add code
Jun 05, 2025
Viaarxiv icon

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Add code
May 29, 2025
Figure 1 for SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Figure 2 for SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Figure 3 for SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Figure 4 for SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Viaarxiv icon

AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking

Add code
May 22, 2025
Viaarxiv icon

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis

Add code
Feb 19, 2025
Viaarxiv icon

Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations?

Add code
Oct 30, 2024
Viaarxiv icon

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Add code
Oct 18, 2024
Figure 1 for LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Figure 2 for LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Figure 3 for LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Figure 4 for LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Viaarxiv icon

Defending Jailbreak Prompts via In-Context Adversarial Game

Add code
Feb 20, 2024
Figure 1 for Defending Jailbreak Prompts via In-Context Adversarial Game
Figure 2 for Defending Jailbreak Prompts via In-Context Adversarial Game
Figure 3 for Defending Jailbreak Prompts via In-Context Adversarial Game
Figure 4 for Defending Jailbreak Prompts via In-Context Adversarial Game
Viaarxiv icon

SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark

Add code
Feb 06, 2024
Figure 1 for SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
Figure 2 for SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
Figure 3 for SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
Figure 4 for SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
Viaarxiv icon