Discover and explore top open-source AI tools and projects—updated daily.
SWE-benchOpen-sourced AI model performance data for code generation benchmarks
Top 99.1% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides open-sourced predictions, execution logs, and reasoning traces from model inference and evaluation runs on the SWE-bench task. It targets researchers and developers in code generation, offering a transparent and reproducible platform for benchmark submissions and analysis. The project aims to advance the scientific understanding of code generation models.
How It Works
The project organizes experimental data into evaluation/ and validation/ directories, categorizing submissions by SWE-bench splits (e.g., lite, verified, multimodal). Each submission is stored in a dedicated subfolder containing model predictions, metadata, execution logs, and detailed reasoning traces. This structure is designed to facilitate community review, reproducibility, and transparency of AI-driven code generation experiments.
Quick Start & Requirements
python -m analysis.download_logs evaluation/<split>/<date + model> (requires AWS account and configured AWS CLI).sb-cli tool or running evaluations locally.all_preds.jsonl or preds.json, metadata.yaml, README.md, trajs/ (reasoning traces), and logs/.Highlighted Details
validation/test_202404 split for re-running evaluations to ensure task instance reproducibility and hosts logs/trajectories on a public S3 bucket.Maintenance & Community
For questions, create an issue in the repository or contact johnby@stanford.edu or carlosej@princeton.edu. No specific community channels (e.g., Slack, Discord) or roadmap links are detailed in this README snippet.
Licensing & Compatibility
The specific license for this experiments repository is not explicitly stated in the provided README. Users should consult the main SWE-bench repository for licensing information relevant to evaluation and submission.
Limitations & Caveats
2 weeks ago
Inactive
groq
harbor-framework