experiments by SWE-bench

Open-sourced AI model performance data for code generation benchmarks

Created 2 years ago

268 stars

Top 95.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Carlos E. Jimenez

Coauthor of SWE-bench, SWE-agent

Paul Gauthier

Founder of Aider

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides open-sourced predictions, execution logs, and reasoning traces from model inference and evaluation runs on the SWE-bench task. It targets researchers and developers in code generation, offering a transparent and reproducible platform for benchmark submissions and analysis. The project aims to advance the scientific understanding of code generation models.

How It Works

The project organizes experimental data into evaluation/ and validation/ directories, categorizing submissions by SWE-bench splits (e.g., lite, verified, multimodal). Each submission is stored in a dedicated subfolder containing model predictions, metadata, execution logs, and detailed reasoning traces. This structure is designed to facilitate community review, reproducibility, and transparency of AI-driven code generation experiments.

Quick Start & Requirements

Log/Trajectory Download: Use python -m analysis.download_logs evaluation/<split>/<date + model> (requires AWS account and configured AWS CLI).
Evaluation: Refer to the main SWE-bench repository for instructions on using the sb-cli tool or running evaluations locally.
Submission Assets: Required for leaderboard participation include all_preds.jsonl or preds.json, metadata.yaml, README.md, trajs/ (reasoning traces), and logs/.
Prerequisites: AWS account and AWS CLI for data download.

Highlighted Details

Submission Policy Update: As of November 18, 2025, SWE-bench Verified and Multilingual leaderboards exclusively accept submissions from academic teams and research institutions with open-source methods and peer-reviewed publications, shifting focus to reproducible academic research.
Reasoning Trace Requirement: Submissions must now include human-readable reasoning traces generated during inference, detailing intermediate steps. This aims to provide insight into model behavior without mandating code releases.
Reproducibility Focus: The repository includes a validation/test_202404 split for re-running evaluations to ensure task instance reproducibility and hosts logs/trajectories on a public S3 bucket.

Maintenance & Community

For questions, create an issue in the repository or contact johnby@stanford.edu or carlosej@princeton.edu. No specific community channels (e.g., Slack, Discord) or roadmap links are detailed in this README snippet.

Licensing & Compatibility

The specific license for this experiments repository is not explicitly stated in the provided README. Users should consult the main SWE-bench repository for licensing information relevant to evaluation and submission.

Limitations & Caveats

Restricted Leaderboards: The Verified and Multilingual leaderboards are now exclusively for academic/research institutions, limiting participation for commercial entities.
Reasoning Trace Overhead: The mandatory inclusion of reasoning traces, while beneficial for transparency, adds an implementation and submission requirement.
AWS Dependency: Downloading experimental data necessitates an AWS account and CLI configuration.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days