Benchmark for evaluating LLMs in real-world software engineering tasks
Top 29.0% on sourcepulse
This repository provides the dataset and code for the SWE-Lancer benchmark, designed to evaluate the capabilities of frontier Large Language Models (LLMs) in real-world freelance software engineering tasks. It targets researchers and developers interested in assessing LLM performance in complex, multi-turn, and interactive coding scenarios, offering a standardized framework for reproducible evaluation.
How It Works
SWE-Lancer simulates freelance software engineering projects by presenting LLMs with tasks that require code generation, debugging, and interaction with a simulated development environment. The system utilizes a modular architecture, allowing for custom compute interfaces to integrate with various execution backends. This approach enables flexible deployment and scalability for large-scale benchmarking.
Quick Start & Requirements
uv sync
for package management and activate the virtual environment (source .venv/bin/activate
), then install project packages: uv pip install -e project/"nanoeval"
, uv pip install -e project/"alcatraz"
, uv pip install -e project/"nanoeval_alcatraz"
. Alternatively, use python -m venv .venv
, source .venv/bin/activate
, pip install -r requirements.txt
, and then install project packages with pip install -e
.sample.env
to .env
.uv run python run_swelancer.py
.Highlighted Details
download_videos.py
for models supporting video input.Maintenance & Community
The repository is maintained by OpenAI, with contact information provided for questions and contributions. Updates to tasks, scaffolding, and codebase are ongoing.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The open-source agent and harness may differ from the internal scaffold used for paper performance metrics. Scaling SWE-Lancer requires implementing a custom ComputerInterface
and modifying the _start_computer
function.
2 weeks ago
1 day