multi-swe-bench  by multi-swe-bench

Multilingual benchmark for LLM-powered code issue resolution

Created 7 months ago
261 stars

Top 97.5% on SourcePulse

GitHubView on GitHub
Project Summary

Multi-SWE-bench addresses the critical need for multilingual benchmarks in evaluating Large Language Models (LLMs) for real-world code issue resolution. It offers a comprehensive framework spanning seven programming languages, providing a robust dataset of 1,632 curated instances to accelerate progress in automated issue resolution and Reinforcement Learning (RL) research. This benchmark is designed for researchers and practitioners seeking to advance LLM capabilities in software engineering tasks beyond Python-centric evaluations.

How It Works

The project leverages a meticulously curated dataset of 1,632 instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++, sourced from real-world code issues and validated by expert annotators. For reproducible evaluations, Multi-SWE-bench utilizes Docker containers. The framework supports multiple agent execution environments, including Agentless, SWE-agent, and OpenHands, allowing for diverse testing scenarios. The Multi-SWE-RL initiative further expands this by providing a large-scale RL dataset with 4,723 instances to foster community-driven research.

Quick Start & Requirements

Multi-SWE-bench is set up using Docker for reproducible evaluations. The primary installation involves cloning the repository (git clone git@github.com:multi-swe-bench/multi-swe-bench.git), navigating into the directory (cd multi-swe-bench), and running make install. Key requirements include Docker installation and preparation of patch files and dataset files (in JSONL format, available on Hugging Face). Optional Docker images can be downloaded using provided scripts. Evaluation is initiated via a Python command, specifying a configuration file: python -m multi_swe_bench.harness.run_evaluation --config /path/to/your/config.json. Detailed setup guides and community resources are available.

Highlighted Details

  • Comprehensive evaluation of nine leading LLMs (e.g., GPT-4o, Claude-3.5-Sonnet) across three agent frameworks.
  • Accepted to the NeurIPS 2025 Datasets and Benchmarks track, signifying academic recognition.
  • Includes specialized versions like Multi-SWE-bench flash (300 instances) and mini (400 instances) for rapid and efficient evaluation.
  • All data, code, and container images are fully open-source, promoting community contributions and extensions.
  • A "hints" field has been added to instances to clarify variable definitions in patches.

Maintenance & Community

The project is developed by the ByteDance Seed team. It actively fosters a community through its Multi-SWE-RL initiative, which includes a Contribution Incentive Plan and encourages participation via a dedicated Discord channel for discussions and collaboration.

Licensing & Compatibility

This project is licensed under the Apache License 2.0. This permissive license allows for broad use, modification, and distribution, including in commercial and closed-source applications, with standard attribution requirements.

Limitations & Caveats

While comprehensive, the setup requires familiarity with Docker and the preparation of specific data files, which may present an initial learning curve. The large scale of the benchmark and evaluation process can also demand significant computational resources. The README does not detail known bugs or specific limitations of the current benchmark versions.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
8
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
5 more.

openbench by groq

2.6%
592
Provider-agnostic LLM evaluation infrastructure
Created 2 months ago
Updated 2 days ago
Feedback? Help us improve.