SWELancer-Benchmark by openai

Benchmark for evaluating LLMs in real-world software engineering tasks

Created 10 months ago

1,437 stars

Top 28.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Yiran Wu

Coauthor of AutoGen

Vincent Weisser

Cofounder of Prime Intellect

Travis Fischer

Founder of Agentic

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

This repository provides the dataset and code for the SWE-Lancer benchmark, designed to evaluate the capabilities of frontier Large Language Models (LLMs) in real-world freelance software engineering tasks. It targets researchers and developers interested in assessing LLM performance in complex, multi-turn, and interactive coding scenarios, offering a standardized framework for reproducible evaluation.

How It Works

SWE-Lancer simulates freelance software engineering projects by presenting LLMs with tasks that require code generation, debugging, and interaction with a simulated development environment. The system utilizes a modular architecture, allowing for custom compute interfaces to integrate with various execution backends. This approach enables flexible deployment and scalability for large-scale benchmarking.

Quick Start & Requirements

Installation: Use uv sync for package management and activate the virtual environment (source .venv/bin/activate), then install project packages: uv pip install -e project/"nanoeval", uv pip install -e project/"alcatraz", uv pip install -e project/"nanoeval_alcatraz". Alternatively, use python -m venv .venv, source .venv/bin/activate, pip install -r requirements.txt, and then install project packages with pip install -e.
Prerequisites: Python 3.11, Docker, OpenAI API key, and username.
Setup: Follow the Docker build instructions for Apple Silicon or Intel-based systems. Configure environment variables by copying sample.env to .env.
Running: Execute uv run python run_swelancer.py.
Documentation: SWE-Lancer Paper

Highlighted Details

Benchmark dataset for evaluating LLMs in freelance software engineering.
Supports custom compute interfaces for flexible integration with diverse infrastructure.
Includes utilities like download_videos.py for models supporting video input.
Offers a "SWE-Lancer-Lite" version with a smaller dataset for quicker evaluations.

Maintenance & Community

The repository is maintained by OpenAI, with contact information provided for questions and contributions. Updates to tasks, scaffolding, and codebase are ongoing.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The open-source agent and harness may differ from the internal scaffold used for paper performance metrics. Scaling SWE-Lancer requires implementing a custom ComputerInterface and modifying the _start_computer function.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days