SWELancer-Benchmark  by openai

Benchmark for evaluating LLMs in real-world software engineering tasks

created 5 months ago
1,434 stars

Top 29.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the dataset and code for the SWE-Lancer benchmark, designed to evaluate the capabilities of frontier Large Language Models (LLMs) in real-world freelance software engineering tasks. It targets researchers and developers interested in assessing LLM performance in complex, multi-turn, and interactive coding scenarios, offering a standardized framework for reproducible evaluation.

How It Works

SWE-Lancer simulates freelance software engineering projects by presenting LLMs with tasks that require code generation, debugging, and interaction with a simulated development environment. The system utilizes a modular architecture, allowing for custom compute interfaces to integrate with various execution backends. This approach enables flexible deployment and scalability for large-scale benchmarking.

Quick Start & Requirements

  • Installation: Use uv sync for package management and activate the virtual environment (source .venv/bin/activate), then install project packages: uv pip install -e project/"nanoeval", uv pip install -e project/"alcatraz", uv pip install -e project/"nanoeval_alcatraz". Alternatively, use python -m venv .venv, source .venv/bin/activate, pip install -r requirements.txt, and then install project packages with pip install -e.
  • Prerequisites: Python 3.11, Docker, OpenAI API key, and username.
  • Setup: Follow the Docker build instructions for Apple Silicon or Intel-based systems. Configure environment variables by copying sample.env to .env.
  • Running: Execute uv run python run_swelancer.py.
  • Documentation: SWE-Lancer Paper

Highlighted Details

  • Benchmark dataset for evaluating LLMs in freelance software engineering.
  • Supports custom compute interfaces for flexible integration with diverse infrastructure.
  • Includes utilities like download_videos.py for models supporting video input.
  • Offers a "SWE-Lancer-Lite" version with a smaller dataset for quicker evaluations.

Maintenance & Community

The repository is maintained by OpenAI, with contact information provided for questions and contributions. Updates to tasks, scaffolding, and codebase are ongoing.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The open-source agent and harness may differ from the internal scaffold used for paper performance metrics. Scaling SWE-Lancer requires implementing a custom ComputerInterface and modifying the _start_computer function.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
11
Issues (30d)
27
Star History
88 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

SWE-Gym by SWE-Gym

1.0%
513
Environment for training software engineering agents
created 9 months ago
updated 4 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
10 more.

JARVIS by microsoft

0.1%
24k
System for LLM-orchestrated AI task automation
created 2 years ago
updated 4 days ago
Feedback? Help us improve.