ArtifactsBenchmark by Tencent-Hunyuan

Benchmark for evaluating LLM-generated visual and interactive code

Created 8 months ago

257 stars

Top 98.3% on SourcePulse

Project Summary

Summary

ArtifactsBench tackles the critical evaluation gap for LLMs generating dynamic, interactive visual artifacts, moving beyond traditional code correctness benchmarks. It introduces a novel, automated, multimodal evaluation paradigm to assess visual fidelity and interactive integrity. This framework serves researchers and developers building user-centric generative models, providing a scalable, accurate tool to accelerate progress.

How It Works

The core is a multimodal pipeline: generated artifacts are programmatically rendered and their dynamic behavior captured (screenshots, GIFs). An MLLM-as-Judge then scores these traces against fine-grained checklists. This enables fully automated, yet highly reliable scoring, achieving 94.4% agreement with human judgment, unlike static code benchmarks.

Quick Start & Requirements

Installation requires:

pip install vllm==0.8.3
pip install pytest-playwright
playwright install
playwright install-deps
pip install transformers
pip install requests
pip install tqdm

Evaluation necessitates API keys for models like Gemini. The benchmark dataset is at dataset/artifacts_bench.json. Links to the paper are available via citation.

Highlighted Details

Features 1,825 diverse tasks (web dev, SVG, games) stratified by difficulty.
Achieves 94.4% human preference agreement, validating its automated approach.
Showcases top performance: GPT-5 (72.55, record), GPT-OSS-120B (57.69, open-source leader).
Guarantees 100% open-source data and complete paper reproducibility.

Maintenance & Community

Led by Tencent Hunyuan Team with contributions from Tencent, NJU, and PKU. No specific community channels (Discord/Slack) are listed, but an issue contact email (adamwzhang@tencent.com) is provided.

Licensing & Compatibility

The repository states it is licensed under a LICENSE file. However, the specific license type and its implications for commercial use or closed-source linking are not detailed within the provided README content.

Limitations & Caveats

The README does not explicitly detail known bugs or alpha status. Evaluation requires proprietary LLM API access (e.g., Gemini), potentially limiting adoption. The specific license terms are not disclosed in the README.

ArtifactsBenchmark by Tencent-Hunyuan

Explore Similar Projects

MMBench by open-compass

Q-Align by Q-Future

MathVista by lupantech

Awesome-LLM-Eval by onejune2018

PAI-RAG by aigc-apps

SWELancer-Benchmark by openai

arena-hard-auto by lmarena

PaddleMIX by PaddlePaddle

inspect_evals by UKGovernmentBEIS

evalscope by modelscope

lmms-eval by EvolvingLMMs-Lab

SWE-bench by SWE-bench