Discover and explore top open-source AI tools and projects—updated daily.
openaiBenchmark for evaluating LLMs in real-world software engineering tasks
Top 28.4% on SourcePulse
This repository provides the dataset and code for the SWE-Lancer benchmark, designed to evaluate the capabilities of frontier Large Language Models (LLMs) in real-world freelance software engineering tasks. It targets researchers and developers interested in assessing LLM performance in complex, multi-turn, and interactive coding scenarios, offering a standardized framework for reproducible evaluation.
How It Works
SWE-Lancer simulates freelance software engineering projects by presenting LLMs with tasks that require code generation, debugging, and interaction with a simulated development environment. The system utilizes a modular architecture, allowing for custom compute interfaces to integrate with various execution backends. This approach enables flexible deployment and scalability for large-scale benchmarking.
Quick Start & Requirements
uv sync for package management and activate the virtual environment (source .venv/bin/activate), then install project packages: uv pip install -e project/"nanoeval", uv pip install -e project/"alcatraz", uv pip install -e project/"nanoeval_alcatraz". Alternatively, use python -m venv .venv, source .venv/bin/activate, pip install -r requirements.txt, and then install project packages with pip install -e.sample.env to .env.uv run python run_swelancer.py.Highlighted Details
download_videos.py for models supporting video input.Maintenance & Community
The repository is maintained by OpenAI, with contact information provided for questions and contributions. Updates to tasks, scaffolding, and codebase are ongoing.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The open-source agent and harness may differ from the internal scaffold used for paper performance metrics. Scaling SWE-Lancer requires implementing a custom ComputerInterface and modifying the _start_computer function.
3 months ago
Inactive
the-crypt-keeper
mlabonne
LiveCodeBench
openai
huggingface
SWE-bench