MultiPL-E  by nuprl

Benchmark for evaluating code generation LLMs across multiple programming languages

Created 3 years ago
278 stars

Top 93.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

MultiPL-E is a benchmark system for evaluating Large Language Models (LLMs) on code generation tasks across multiple programming languages. It translates existing Python-based unit test-driven benchmarks like HumanEval and MBPP into 18 other languages, enabling comprehensive multilingual code LLM assessment.

How It Works

MultiPL-E employs a two-stage process: first, it generates code completions using LLMs, and second, it executes these completions against translated unit tests. The system's core innovation lies in its flexible translation framework, which allows users to define language-specific translators and execution scripts, facilitating the extension of benchmarks to new programming languages. This approach simplifies the creation of polyglot evaluation suites.

Quick Start & Requirements

  • Install: pip3 install aiohttp numpy tqdm pytest datasets torch transformers
  • Prerequisites: Python 3.8+, Docker or Podman.
  • Setup: Clone the repository (git clone https://github.com/nuprl/MultiPL-E), cd MultiPL-E.
  • Resources: Generation requires a GPU (e.g., ~13 GB VRAM for SantaCoder with batch-size 20). Execution requires a containerized environment or manually installed toolchains for each target language.
  • Docs: BigCode Code Generation LM Harness

Highlighted Details

  • Translates HumanEval and MBPP benchmarks to 18 languages.
  • Supports evaluation of various code generation LLMs (e.g., SantaCoder).
  • Provides a containerized execution environment with pre-installed toolchains.
  • Detailed instructions for adding support for new languages and benchmarks.

Maintenance & Community

The project is authored by a large team from Northeastern University and other institutions. Contributions are welcomed, and a changelog is available for acknowledgments.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify a license, which is a significant caveat for adoption. While the system supports adding new languages, the process for statically typed languages is noted as more challenging.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

0.7%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 3 days ago
Feedback? Help us improve.