We present HumanEval Pro and MBPP Pro, two expanded versions of the traditional HumanEval and MBPP benchmarks to evaluate LLMs on self-invoking code generation task. Self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one.
The benchmark was constructed as follows:
The instruction-tuned models demonstrate only marginal improvements compared to the base models on self-invoking code generation. When observing the correlation between HumanEval (or MBPP) and HumanEval Pro (or MBPP Pro), we see that the orange dot (indicates base model) is always to the upper left of the blue dot (indicates instruction-tuned model). However, for the comparison between HumanEval (or MBPP) and HumanEval+ (or MBPP+), the blue dot is always distributed to the upper of orange dot (even in a line on HumanEval vs HumanEval+). Overall, this suggests that while instruction-based fine-tuning significantly improves performance on simpler benchmarks like HumanEval (+) (or MBPP (+)), its efficiency diminishes for more complex self-invoking code generation tasks.
Although some SoTA LLMs such as Qwen2.5-Coder-32B-instruct successfully solve 90% of base problems on the original HumanEval and MBPP benchmarks, over 25% of problems still fail on more challenging HumanEval Pro and MBPP Pro benchmarks with self-invoking code generation (as shown in the top right of each subfigure in Figure 5). This suggests that the drop in the model's scores on HumanEval Pro and MBPP Pro is largely due to its lower accuracy in generating self-invoking code compared to direct code generation.
The instruction-tuned model typically has a significantly higher number of (Passed, Passed) instances compared to the base model. However, for samples that pass the base problems but fail in HumanEval Pro and MBPP Pro, i.e., (Failed, Passed), the instruct model does not demonstrate notable improvement.
Using CoT led to some improvements, after applying CoT, the pass@1 of the selected models on HumanEval Pro witnesses a significant improvement. Notably, the accuracy of GPT-4o increases from 75.0% to 78.0%. On MBPP Pro, although the model does not show a significant improvement, it still maintains its original performance level, indicating that CoT can enhance the accuracy of model-generated code to a notable degree.
CoT could help Code LLMs to generate more reliable code when scheduling across multiple code-related problems. The AssertionError number decreases from 28 to 24. This indicates that CoT prompting enables the model to generate code that more frequently passes test cases. The NameError number decreases, which indicates that CoT prompting helps the model produce more self-contained code snippets and reduces the use of undefined variables. These findings highlight that CoT prompting could help LLMs to generate more accurate and reliable solution on self-invoking code generation task.
Primarily, \textit{AssertionErrors} constitute the primary source of errors for all models on self-invoking code generation task, which suggests that the majority of errors are still due to failing test cases. Secondly, the \textit{NameErrors}, which is often caused by the undefined variable or function, contribute significantly to the error rate. This suggests that despite the function infomation being provided in the prompt, many functions still fail to generate the correct function header. This may indicate that the LLM has issues with understanding or correctly utilizing the provided information. Finally, we also found that some \textit{TypeErrors} and \textit{ValueErrors} accounted for a relatively small proportion of errors, which shows that LLM still has some deficiencies in handling variable types and usage when generating self-invoking code.
Overall, we believe that HumanEval Pro and MBPP Pro provides a complementary perspective to LLM code & reasoning ability evaluations and could help creators of future LLMs to understand their models from a different prospective!
@article{yu2024humaneval,
title={HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation},
author={Yu, Zhaojian and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping},
journal={arXiv preprint arXiv:2412.21199},
year={2024}
}