SWELancer-Benchmark  by openai

Benchmark for evaluating LLMs in real-world software engineering tasks

Created 7 months ago
1,436 stars

Top 28.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the dataset and code for the SWE-Lancer benchmark, designed to evaluate the capabilities of frontier Large Language Models (LLMs) in real-world freelance software engineering tasks. It targets researchers and developers interested in assessing LLM performance in complex, multi-turn, and interactive coding scenarios, offering a standardized framework for reproducible evaluation.

How It Works

SWE-Lancer simulates freelance software engineering projects by presenting LLMs with tasks that require code generation, debugging, and interaction with a simulated development environment. The system utilizes a modular architecture, allowing for custom compute interfaces to integrate with various execution backends. This approach enables flexible deployment and scalability for large-scale benchmarking.

Quick Start & Requirements

  • Installation: Use uv sync for package management and activate the virtual environment (source .venv/bin/activate), then install project packages: uv pip install -e project/"nanoeval", uv pip install -e project/"alcatraz", uv pip install -e project/"nanoeval_alcatraz". Alternatively, use python -m venv .venv, source .venv/bin/activate, pip install -r requirements.txt, and then install project packages with pip install -e.
  • Prerequisites: Python 3.11, Docker, OpenAI API key, and username.
  • Setup: Follow the Docker build instructions for Apple Silicon or Intel-based systems. Configure environment variables by copying sample.env to .env.
  • Running: Execute uv run python run_swelancer.py.
  • Documentation: SWE-Lancer Paper

Highlighted Details

  • Benchmark dataset for evaluating LLMs in freelance software engineering.
  • Supports custom compute interfaces for flexible integration with diverse infrastructure.
  • Includes utilities like download_videos.py for models supporting video input.
  • Offers a "SWE-Lancer-Lite" version with a smaller dataset for quicker evaluations.

Maintenance & Community

The repository is maintained by OpenAI, with contact information provided for questions and contributions. Updates to tasks, scaffolding, and codebase are ongoing.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The open-source agent and harness may differ from the internal scaffold used for paper performance metrics. Scaling SWE-Lancer requires implementing a custom ComputerInterface and modifying the _start_computer function.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

SWE-bench by SWE-bench

2.3%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 1 year ago
Updated 21 hours ago
Feedback? Help us improve.