CodeEval-Pro Leaderboard
HumanEval Pro and MBPP Pro evaluate LLMs on
self-invoking code generation task to show their reasoning ability in code generation.
📝 Notes
1. All self-invoking samples are generated from scratch using our codebase.
2. The pass@1 scores are reported with greedy generation strategy, Models are ranked by pass@1.
🤗 Acknowlegement and More Leaderboards
The leaderboard code is inspired from EvalPlus
and
CRUXEval, Thanks a lot!
We also recommend the following leaderboards for measuring code LM ability on various coding tasks,
such as
EvalPlus Leaderboard,
LiveCodeBench Leaderboard,
BigCodeBench
Leaderboard,
and
McEval
Leaderboard.
📝 Notes
1. All self-invoking samples are generated from scratch using our codebase.
2. The pass@1 scores are reported with greedy generation strategy, Models are ranked by pass@1.
🤗 Acknowlegement and More Leaderboards
The leaderboard code is inspired from EvalPlus and CRUXEval, Thanks a lot! We also recommend the following leaderboards for measuring code LM ability on various coding tasks, such as EvalPlus Leaderboard, LiveCodeBench Leaderboard, BigCodeBench Leaderboard, and McEval Leaderboard.