Discover and explore top open-source AI tools and projects—updated daily.
Multilingual benchmark for LLM-powered code issue resolution
Top 97.5% on SourcePulse
Multi-SWE-bench addresses the critical need for multilingual benchmarks in evaluating Large Language Models (LLMs) for real-world code issue resolution. It offers a comprehensive framework spanning seven programming languages, providing a robust dataset of 1,632 curated instances to accelerate progress in automated issue resolution and Reinforcement Learning (RL) research. This benchmark is designed for researchers and practitioners seeking to advance LLM capabilities in software engineering tasks beyond Python-centric evaluations.
How It Works
The project leverages a meticulously curated dataset of 1,632 instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++, sourced from real-world code issues and validated by expert annotators. For reproducible evaluations, Multi-SWE-bench utilizes Docker containers. The framework supports multiple agent execution environments, including Agentless, SWE-agent, and OpenHands, allowing for diverse testing scenarios. The Multi-SWE-RL initiative further expands this by providing a large-scale RL dataset with 4,723 instances to foster community-driven research.
Quick Start & Requirements
Multi-SWE-bench is set up using Docker for reproducible evaluations. The primary installation involves cloning the repository (git clone git@github.com:multi-swe-bench/multi-swe-bench.git
), navigating into the directory (cd multi-swe-bench
), and running make install
. Key requirements include Docker installation and preparation of patch files and dataset files (in JSONL format, available on Hugging Face). Optional Docker images can be downloaded using provided scripts. Evaluation is initiated via a Python command, specifying a configuration file: python -m multi_swe_bench.harness.run_evaluation --config /path/to/your/config.json
. Detailed setup guides and community resources are available.
Highlighted Details
Maintenance & Community
The project is developed by the ByteDance Seed team. It actively fosters a community through its Multi-SWE-RL initiative, which includes a Contribution Incentive Plan and encourages participation via a dedicated Discord channel for discussions and collaboration.
Licensing & Compatibility
This project is licensed under the Apache License 2.0. This permissive license allows for broad use, modification, and distribution, including in commercial and closed-source applications, with standard attribution requirements.
Limitations & Caveats
While comprehensive, the setup requires familiarity with Docker and the preparation of specific data files, which may present an initial learning curve. The large scale of the benchmark and evaluation process can also demand significant computational resources. The README does not detail known bugs or specific limitations of the current benchmark versions.
1 week ago
Inactive