Benchmark for evaluating code generation LLMs across multiple programming languages
Top 95.6% on SourcePulse
MultiPL-E is a benchmark system for evaluating Large Language Models (LLMs) on code generation tasks across multiple programming languages. It translates existing Python-based unit test-driven benchmarks like HumanEval and MBPP into 18 other languages, enabling comprehensive multilingual code LLM assessment.
How It Works
MultiPL-E employs a two-stage process: first, it generates code completions using LLMs, and second, it executes these completions against translated unit tests. The system's core innovation lies in its flexible translation framework, which allows users to define language-specific translators and execution scripts, facilitating the extension of benchmarks to new programming languages. This approach simplifies the creation of polyglot evaluation suites.
Quick Start & Requirements
pip3 install aiohttp numpy tqdm pytest datasets torch transformers
git clone https://github.com/nuprl/MultiPL-E
), cd MultiPL-E
.Highlighted Details
Maintenance & Community
The project is authored by a large team from Northeastern University and other institutions. Contributions are welcomed, and a changelog is available for acknowledgments.
Licensing & Compatibility
The repository does not explicitly state a license in the README. This requires clarification for commercial use or integration into closed-source projects.
Limitations & Caveats
The README does not specify a license, which is a significant caveat for adoption. While the system supports adding new languages, the process for statically typed languages is noted as more challenging.
1 week ago
Inactive