web-bench  by bytedance

LLM benchmark for real-world web development tasks

Created 9 months ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Web-Bench is a benchmark suite designed to evaluate Large Language Models (LLMs) on practical web development tasks. It targets AI researchers and developers building LLM agents for code generation, offering a challenging, real-world simulation of web project development to identify current model limitations.

How It Works

The benchmark comprises 50 complex web development projects, each broken down into 20 sequentially dependent tasks. These tasks mirror professional workflows, covering foundational web standards and frameworks. Projects are designed by experienced engineers and are substantial, requiring 4–8 hours for a senior human developer to complete, thus providing a rigorous testbed for LLM capabilities beyond simpler code generation tasks.

Quick Start & Requirements

  • Primary installation uses Docker. Users create a directory containing config.json5 (to specify models like openai/gpt-4o) and docker-compose.yml.
  • Prerequisites include Docker and API keys for desired models (e.g., OPENROUTER_API_KEY, ANTHROPIC_API_KEY).
  • Run command: docker compose up. Evaluation reports are generated under ./report/.
  • Relevant links: Install (assumed path), Paper, LeaderBoard.

Highlighted Details

  • Features 50 projects with 20 sequential tasks each, simulating professional development lifecycles.
  • Tasks cover core web standards and popular frameworks.
  • Projects are designed to be time-consuming (4–8 hours for senior engineers).
  • State-of-the-art performance (Claude 3.7 Sonnet) on the benchmark agent is 25.1% Pass@1, indicating significant challenges for current LLMs.

Maintenance & Community

  • Community engagement is facilitated via Lark (QR code provided in README) and Discord.
  • Links: Discord (specific URL not provided), Lark (via QR code).

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • The benchmark is highly challenging, with current SOTA LLMs achieving only 25.1% Pass@1, suggesting limited practical capability for complex, multi-step web development tasks.
  • Setup requires managing API keys and Docker configurations.
Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.1%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 2 years ago
Updated 10 months ago
Feedback? Help us improve.