Discover and explore top open-source AI tools and projects—updated daily.
bytedanceLLM benchmark for real-world web development tasks
Top 98.5% on SourcePulse
Web-Bench is a benchmark suite designed to evaluate Large Language Models (LLMs) on practical web development tasks. It targets AI researchers and developers building LLM agents for code generation, offering a challenging, real-world simulation of web project development to identify current model limitations.
How It Works
The benchmark comprises 50 complex web development projects, each broken down into 20 sequentially dependent tasks. These tasks mirror professional workflows, covering foundational web standards and frameworks. Projects are designed by experienced engineers and are substantial, requiring 4–8 hours for a senior human developer to complete, thus providing a rigorous testbed for LLM capabilities beyond simpler code generation tasks.
Quick Start & Requirements
config.json5 (to specify models like openai/gpt-4o) and docker-compose.yml.OPENROUTER_API_KEY, ANTHROPIC_API_KEY).docker compose up. Evaluation reports are generated under ./report/.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 months ago
Inactive
carlini
openai
THUDM