CodeEval-Pro Leaderboard

HumanEval Pro and MBPP Pro evaluate LLMs on self-invoking code generation task to show their reasoning ability in code generation.

HumanEval Pro (Pass@1)

MBPP Pro (Pass@1)

📝 Notes

1. All self-invoking samples are generated from scratch using our codebase.
2. The pass@1 scores are reported with greedy generation strategy, Models are ranked by pass@1.

🤗 Acknowlegement and More Leaderboards

The leaderboard code is inspired from EvalPlus and CRUXEval, Thanks a lot! We also recommend the following leaderboards for measuring code LM ability on various coding tasks, such as EvalPlus Leaderboard, LiveCodeBench Leaderboard, BigCodeBench Leaderboard, and McEval Leaderboard.