MultiPL-E by nuprl

Benchmark for evaluating code generation LLMs across multiple programming languages

Created 3 years ago

290 stars

Top 91.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ying Sheng

Coauthor of SGLang

Project Summary

MultiPL-E is a benchmark system for evaluating Large Language Models (LLMs) on code generation tasks across multiple programming languages. It translates existing Python-based unit test-driven benchmarks like HumanEval and MBPP into 18 other languages, enabling comprehensive multilingual code LLM assessment.

How It Works

MultiPL-E employs a two-stage process: first, it generates code completions using LLMs, and second, it executes these completions against translated unit tests. The system's core innovation lies in its flexible translation framework, which allows users to define language-specific translators and execution scripts, facilitating the extension of benchmarks to new programming languages. This approach simplifies the creation of polyglot evaluation suites.

Quick Start & Requirements

Install: pip3 install aiohttp numpy tqdm pytest datasets torch transformers
Prerequisites: Python 3.8+, Docker or Podman.
Setup: Clone the repository (git clone https://github.com/nuprl/MultiPL-E), cd MultiPL-E.
Resources: Generation requires a GPU (e.g., ~13 GB VRAM for SantaCoder with batch-size 20). Execution requires a containerized environment or manually installed toolchains for each target language.
Docs: BigCode Code Generation LM Harness

Highlighted Details

Translates HumanEval and MBPP benchmarks to 18 languages.
Supports evaluation of various code generation LLMs (e.g., SantaCoder).
Provides a containerized execution environment with pre-installed toolchains.
Detailed instructions for adding support for new languages and benchmarks.

Maintenance & Community

The project is authored by a large team from Northeastern University and other institutions. Contributions are welcomed, and a changelog is available for acknowledgments.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify a license, which is a significant caveat for adoption. While the system supports adding new languages, the process for statically typed languages is noted as more challenging.

Health Check

Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days