HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task

Zhaojian Yu¹, Yilun Zhao², Arman Cohan², Xiao-Ping Zhang¹

¹Tsinghua University ²Yale University

Figure 1: Performance Comparison: HumanEval Pro (and MBPP Pro) vs. HumanEval (and MBPP).

Figure 2: The overview of self-invoking code generation in HumanEval Pro and MBPP Pro. Given a base problem and a related, more complex problem, they are required to solve the base problem and use its solution to address the complex problems.

Introduction

We present HumanEval Pro and MBPP Pro, two expanded versions of the traditional HumanEval and MBPP benchmarks to evaluate LLMs on self-invoking code generation task. Self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one.

Benchmark Construction

Figure 3: The overview of benchmark construction.

The benchmark was constructed as follows:

Self-invoking problem Generation: We use Deepseek-V2.5 to generate the self-invoking problems, as well as their candidate solutions and test inputs.
Solutions Generation: We execute the generated solution with the test inputs in a controlled Python environment to obtain ground truth outputs.
Test Cases Generation: We employ an iterative method involving Python execution check and manual review to ensure that all test cases pass successfully. The final execution results are then used to construct complete test cases with assert command.

Results

Most LLMs have a 10% to 15% absolute performance drop on self-invoking code generation benchmarks.
Large size open-source LLMs have comparable performance with proprietary LLMs on self-invoking benchmarks.
Most instruction-tuned models have less improvements on self-invoking code generation benchmarks (e.g., HumanEval Pro) than traditional benchmarks (e.g.,HumanEval). For instance, Qwen2.5Coder-32B-instruct have 26.8% absolute improvement on HumanEval compared to Qwen2.5Coder-32B-base (from 65.9% to 92.7%) but only 8.5% on HumanEval Pro (from 61.6% to 70.1%).

Table 1: Main result of different models on HumanEval Pro and MBPP Pro.

Insights

The instruction-tuned models demonstrate only marginal improvements compared to the base models on self-invoking code generation. When observing the correlation between HumanEval (or MBPP) and HumanEval Pro (or MBPP Pro), we see that the orange dot (indicates base model) is always to the upper left of the blue dot (indicates instruction-tuned model). However, for the comparison between HumanEval (or MBPP) and HumanEval+ (or MBPP+), the blue dot is always distributed to the upper of orange dot (even in a line on HumanEval vs HumanEval+). Overall, this suggests that while instruction-based fine-tuning significantly improves performance on simpler benchmarks like HumanEval (+) (or MBPP (+)), its efficiency diminishes for more complex self-invoking code generation tasks.

Figure 4: HumanEval (or MBPP) scores against the results on HumanEval Pro and MBPP Pro (HumanEval+ and MBPP+).

Although some SoTA LLMs such as Qwen2.5-Coder-32B-instruct successfully solve 90% of base problems on the original HumanEval and MBPP benchmarks, over 25% of problems still fail on more challenging HumanEval Pro and MBPP Pro benchmarks with self-invoking code generation (as shown in the top right of each subfigure in Figure 5). This suggests that the drop in the model's scores on HumanEval Pro and MBPP Pro is largely due to its lower accuracy in generating self-invoking code compared to direct code generation.

Figure 5: The confusion matrix of different models.

The instruction-tuned model typically has a significantly higher number of (Passed, Passed) instances compared to the base model. However, for samples that pass the base problems but fail in HumanEval Pro and MBPP Pro, i.e., (Failed, Passed), the instruct model does not demonstrate notable improvement.

Chain of Thought

Using CoT led to some improvements, after applying CoT, the pass@1 of the selected models on HumanEval Pro witnesses a significant improvement. Notably, the accuracy of GPT-4o increases from 75.0% to 78.0%. On MBPP Pro, although the model does not show a significant improvement, it still maintains its original performance level, indicating that CoT can enhance the accuracy of model-generated code to a notable degree.

Table 2: The Result with and without CoT on self-invoking code generation benchmarks.

CoT could help Code LLMs to generate more reliable code when scheduling across multiple code-related problems. The AssertionError number decreases from 28 to 24. This indicates that CoT prompting enables the model to generate code that more frequently passes test cases. The NameError number decreases, which indicates that CoT prompting helps the model produce more self-contained code snippets and reduces the use of undefined variables. These findings highlight that CoT prompting could help LLMs to generate more accurate and reliable solution on self-invoking code generation task.

Figure 7: Error types of GPT-4o with and without CoT reasoning on HumanEval Pro.

Error Types

Primarily, \textit{AssertionErrors} constitute the primary source of errors for all models on self-invoking code generation task, which suggests that the majority of errors are still due to failing test cases. Secondly, the \textit{NameErrors}, which is often caused by the undefined variable or function, contribute significantly to the error rate. This suggests that despite the function infomation being provided in the prompt, many functions still fail to generate the correct function header. This may indicate that the LLM has issues with understanding or correctly utilizing the provided information. Finally, we also found that some \textit{TypeErrors} and \textit{ValueErrors} accounted for a relatively small proportion of errors, which shows that LLM still has some deficiencies in handling variable types and usage when generating self-invoking code.

Figure 8: Statistics of error type across different LLMs on HumanEval Pro and MBPP Pro.

Overall, we believe that HumanEval Pro and MBPP Pro provides a complementary perspective to LLM code & reasoning ability evaluations and could help creators of future LLMs to understand their models from a different prospective!

BibTeX

@article{yu2024humaneval,
  title={HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation},
  author={Yu, Zhaojian and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping},
  journal={arXiv preprint arXiv:2412.21199},
  year={2024}
}