Picture for Julia Hockenmaier

Julia Hockenmaier

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

Add code
Mar 18, 2026
Viaarxiv icon

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Add code
Oct 31, 2025
Figure 1 for Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Figure 2 for Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Figure 3 for Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Figure 4 for Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Viaarxiv icon

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Add code
Mar 31, 2025
Figure 1 for Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Figure 2 for Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Figure 3 for Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Figure 4 for Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Viaarxiv icon

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Add code
Mar 25, 2025
Viaarxiv icon

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Add code
Mar 17, 2025
Viaarxiv icon

Entailment-Preserving First-order Logic Representations in Natural Language Entailment

Add code
Feb 24, 2025
Viaarxiv icon

Evaluating Step-by-step Reasoning Traces: A Survey

Add code
Feb 17, 2025
Figure 1 for Evaluating Step-by-step Reasoning Traces: A Survey
Figure 2 for Evaluating Step-by-step Reasoning Traces: A Survey
Figure 3 for Evaluating Step-by-step Reasoning Traces: A Survey
Figure 4 for Evaluating Step-by-step Reasoning Traces: A Survey
Viaarxiv icon

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Add code
Jan 18, 2025
Figure 1 for BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
Figure 2 for BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
Figure 3 for BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
Figure 4 for BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
Viaarxiv icon

Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions

Add code
Aug 28, 2024
Figure 1 for Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Figure 2 for Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Figure 3 for Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Figure 4 for Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Viaarxiv icon

Analyzing the Performance of Large Language Models on Code Summarization

Add code
Apr 10, 2024
Figure 1 for Analyzing the Performance of Large Language Models on Code Summarization
Figure 2 for Analyzing the Performance of Large Language Models on Code Summarization
Figure 3 for Analyzing the Performance of Large Language Models on Code Summarization
Figure 4 for Analyzing the Performance of Large Language Models on Code Summarization
Viaarxiv icon