experiments  by SWE-bench

Open-sourced AI model performance data for code generation benchmarks

Created 1 year ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides open-sourced predictions, execution logs, and reasoning traces from model inference and evaluation runs on the SWE-bench task. It targets researchers and developers in code generation, offering a transparent and reproducible platform for benchmark submissions and analysis. The project aims to advance the scientific understanding of code generation models.

How It Works

The project organizes experimental data into evaluation/ and validation/ directories, categorizing submissions by SWE-bench splits (e.g., lite, verified, multimodal). Each submission is stored in a dedicated subfolder containing model predictions, metadata, execution logs, and detailed reasoning traces. This structure is designed to facilitate community review, reproducibility, and transparency of AI-driven code generation experiments.

Quick Start & Requirements

  • Log/Trajectory Download: Use python -m analysis.download_logs evaluation/<split>/<date + model> (requires AWS account and configured AWS CLI).
  • Evaluation: Refer to the main SWE-bench repository for instructions on using the sb-cli tool or running evaluations locally.
  • Submission Assets: Required for leaderboard participation include all_preds.jsonl or preds.json, metadata.yaml, README.md, trajs/ (reasoning traces), and logs/.
  • Prerequisites: AWS account and AWS CLI for data download.

Highlighted Details

  • Submission Policy Update: As of November 18, 2025, SWE-bench Verified and Multilingual leaderboards exclusively accept submissions from academic teams and research institutions with open-source methods and peer-reviewed publications, shifting focus to reproducible academic research.
  • Reasoning Trace Requirement: Submissions must now include human-readable reasoning traces generated during inference, detailing intermediate steps. This aims to provide insight into model behavior without mandating code releases.
  • Reproducibility Focus: The repository includes a validation/test_202404 split for re-running evaluations to ensure task instance reproducibility and hosts logs/trajectories on a public S3 bucket.

Maintenance & Community

For questions, create an issue in the repository or contact johnby@stanford.edu or carlosej@princeton.edu. No specific community channels (e.g., Slack, Discord) or roadmap links are detailed in this README snippet.

Licensing & Compatibility

The specific license for this experiments repository is not explicitly stated in the provided README. Users should consult the main SWE-bench repository for licensing information relevant to evaluation and submission.

Limitations & Caveats

  • Restricted Leaderboards: The Verified and Multilingual leaderboards are now exclusively for academic/research institutions, limiting participation for commercial entities.
  • Reasoning Trace Overhead: The mandatory inclusion of reasoning traces, while beneficial for transparency, adds an implementation and submission requirement.
  • AWS Dependency: Downloading experimental data necessitates an AWS account and CLI configuration.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
3
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
9 more.

terminal-bench by harbor-framework

2.4%
2k
Benchmark for LLM agents in real terminal environments
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.