Picture for Baishakhi Ray

Baishakhi Ray

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Add code
Oct 31, 2025
Viaarxiv icon

AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development?

Add code
Oct 09, 2025
Figure 1 for AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development?
Figure 2 for AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development?
Figure 3 for AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development?
Figure 4 for AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development?
Viaarxiv icon

Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Add code
Jun 10, 2025
Figure 1 for Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study
Figure 2 for Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study
Figure 3 for Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study
Figure 4 for Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study
Viaarxiv icon

CrashFixer: A crash resolution agent for the Linux kernel

Add code
Apr 29, 2025
Figure 1 for CrashFixer: A crash resolution agent for the Linux kernel
Figure 2 for CrashFixer: A crash resolution agent for the Linux kernel
Figure 3 for CrashFixer: A crash resolution agent for the Linux kernel
Figure 4 for CrashFixer: A crash resolution agent for the Linux kernel
Viaarxiv icon

Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

Add code
Mar 06, 2025
Figure 1 for Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Figure 2 for Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Figure 3 for Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Figure 4 for Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Viaarxiv icon

Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation

Add code
Feb 23, 2025
Viaarxiv icon

AI Software Engineer: Programming with Trust

Add code
Feb 19, 2025
Viaarxiv icon

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Add code
Jan 14, 2025
Figure 1 for CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Figure 2 for CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Figure 3 for CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Figure 4 for CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Viaarxiv icon

Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection

Add code
Dec 16, 2024
Viaarxiv icon

On Mitigating Code LLM Hallucinations with API Documentation

Add code
Jul 13, 2024
Figure 1 for On Mitigating Code LLM Hallucinations with API Documentation
Figure 2 for On Mitigating Code LLM Hallucinations with API Documentation
Figure 3 for On Mitigating Code LLM Hallucinations with API Documentation
Figure 4 for On Mitigating Code LLM Hallucinations with API Documentation
Viaarxiv icon