ArtifactsBenchmark  by Tencent-Hunyuan

Benchmark for evaluating LLM-generated visual and interactive code

Created 6 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ArtifactsBench tackles the critical evaluation gap for LLMs generating dynamic, interactive visual artifacts, moving beyond traditional code correctness benchmarks. It introduces a novel, automated, multimodal evaluation paradigm to assess visual fidelity and interactive integrity. This framework serves researchers and developers building user-centric generative models, providing a scalable, accurate tool to accelerate progress.

How It Works

The core is a multimodal pipeline: generated artifacts are programmatically rendered and their dynamic behavior captured (screenshots, GIFs). An MLLM-as-Judge then scores these traces against fine-grained checklists. This enables fully automated, yet highly reliable scoring, achieving 94.4% agreement with human judgment, unlike static code benchmarks.

Quick Start & Requirements

Installation requires:

pip install vllm==0.8.3
pip install pytest-playwright
playwright install
playwright install-deps
pip install transformers
pip install requests
pip install tqdm

Evaluation necessitates API keys for models like Gemini. The benchmark dataset is at dataset/artifacts_bench.json. Links to the paper are available via citation.

Highlighted Details

  • Features 1,825 diverse tasks (web dev, SVG, games) stratified by difficulty.
  • Achieves 94.4% human preference agreement, validating its automated approach.
  • Showcases top performance: GPT-5 (72.55, record), GPT-OSS-120B (57.69, open-source leader).
  • Guarantees 100% open-source data and complete paper reproducibility.

Maintenance & Community

Led by Tencent Hunyuan Team with contributions from Tencent, NJU, and PKU. No specific community channels (Discord/Slack) are listed, but an issue contact email (adamwzhang@tencent.com) is provided.

Licensing & Compatibility

The repository states it is licensed under a LICENSE file. However, the specific license type and its implications for commercial use or closed-source linking are not detailed within the provided README content.

Limitations & Caveats

The README does not explicitly detail known bugs or alpha status. Evaluation requires proprietary LLM API access (e.g., Gemini), potentially limiting adoption. The specific license terms are not disclosed in the README.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
15 more.

SWE-bench by SWE-bench

0.8%
4k
Benchmark for evaluating LLMs on real-world GitHub issues
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.