web-bench by bytedance

LLM benchmark for real-world web development tasks

Created 1 year ago

285 stars

Top 91.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Addy Osmani

Head of Chrome Developer Experience at Google

Project Summary

Web-Bench is a benchmark suite designed to evaluate Large Language Models (LLMs) on practical web development tasks. It targets AI researchers and developers building LLM agents for code generation, offering a challenging, real-world simulation of web project development to identify current model limitations.

How It Works

The benchmark comprises 50 complex web development projects, each broken down into 20 sequentially dependent tasks. These tasks mirror professional workflows, covering foundational web standards and frameworks. Projects are designed by experienced engineers and are substantial, requiring 4–8 hours for a senior human developer to complete, thus providing a rigorous testbed for LLM capabilities beyond simpler code generation tasks.

Quick Start & Requirements

Primary installation uses Docker. Users create a directory containing config.json5 (to specify models like openai/gpt-4o) and docker-compose.yml.
Prerequisites include Docker and API keys for desired models (e.g., OPENROUTER_API_KEY, ANTHROPIC_API_KEY).
Run command: docker compose up. Evaluation reports are generated under ./report/.
Relevant links: Install (assumed path), Paper, LeaderBoard.

Highlighted Details

Features 50 projects with 20 sequential tasks each, simulating professional development lifecycles.
Tasks cover core web standards and popular frameworks.
Projects are designed to be time-consuming (4–8 hours for senior engineers).
State-of-the-art performance (Claude 3.7 Sonnet) on the benchmark agent is 25.1% Pass@1, indicating significant challenges for current LLMs.

Maintenance & Community

Community engagement is facilitated via Lark (QR code provided in README) and Discord.
Links: Discord (specific URL not provided), Lark (via QR code).

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The benchmark is highly challenging, with current SOTA LLMs achieving only 25.1% Pass@1, suggesting limited practical capability for complex, multi-step web development tasks.
Setup requires managing API keys and Docker configurations.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days