Picture for Sida Wang

Sida Wang

Measuring all the noises of LLM Evals

Add code
Dec 24, 2025
Viaarxiv icon

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Add code
Dec 21, 2025
Viaarxiv icon

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Add code
Nov 12, 2024
Figure 1 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Figure 2 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Figure 3 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Figure 4 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Viaarxiv icon

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Add code
Jul 15, 2024
Figure 1 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Figure 2 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Figure 3 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Figure 4 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Viaarxiv icon

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Add code
Mar 12, 2024
Figure 1 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Figure 2 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Figure 3 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Figure 4 for LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Viaarxiv icon

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

Add code
Mar 07, 2024
Figure 1 for Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Figure 2 for Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Figure 3 for Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Figure 4 for Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Viaarxiv icon

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Add code
Nov 18, 2022
Figure 1 for DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Figure 2 for DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Figure 3 for DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Figure 4 for DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Viaarxiv icon

On Continual Model Refinement in Out-of-Distribution Data Streams

Add code
May 04, 2022
Figure 1 for On Continual Model Refinement in Out-of-Distribution Data Streams
Figure 2 for On Continual Model Refinement in Out-of-Distribution Data Streams
Figure 3 for On Continual Model Refinement in Out-of-Distribution Data Streams
Figure 4 for On Continual Model Refinement in Out-of-Distribution Data Streams
Viaarxiv icon

InCoder: A Generative Model for Code Infilling and Synthesis

Add code
Apr 17, 2022
Figure 1 for InCoder: A Generative Model for Code Infilling and Synthesis
Figure 2 for InCoder: A Generative Model for Code Infilling and Synthesis
Figure 3 for InCoder: A Generative Model for Code Infilling and Synthesis
Figure 4 for InCoder: A Generative Model for Code Infilling and Synthesis
Viaarxiv icon

Deep Natural Language Processing for LinkedIn Search

Add code
Aug 16, 2021
Figure 1 for Deep Natural Language Processing for LinkedIn Search
Figure 2 for Deep Natural Language Processing for LinkedIn Search
Figure 3 for Deep Natural Language Processing for LinkedIn Search
Figure 4 for Deep Natural Language Processing for LinkedIn Search
Viaarxiv icon