Discover and explore top open-source AI tools and projects—updated daily.
Tencent-HunyuanBenchmark for evaluating LLM-generated visual and interactive code
Top 99.1% on SourcePulse
Summary
ArtifactsBench tackles the critical evaluation gap for LLMs generating dynamic, interactive visual artifacts, moving beyond traditional code correctness benchmarks. It introduces a novel, automated, multimodal evaluation paradigm to assess visual fidelity and interactive integrity. This framework serves researchers and developers building user-centric generative models, providing a scalable, accurate tool to accelerate progress.
How It Works
The core is a multimodal pipeline: generated artifacts are programmatically rendered and their dynamic behavior captured (screenshots, GIFs). An MLLM-as-Judge then scores these traces against fine-grained checklists. This enables fully automated, yet highly reliable scoring, achieving 94.4% agreement with human judgment, unlike static code benchmarks.
Quick Start & Requirements
Installation requires:
pip install vllm==0.8.3
pip install pytest-playwright
playwright install
playwright install-deps
pip install transformers
pip install requests
pip install tqdm
Evaluation necessitates API keys for models like Gemini. The benchmark dataset is at dataset/artifacts_bench.json. Links to the paper are available via citation.
Highlighted Details
Maintenance & Community
Led by Tencent Hunyuan Team with contributions from Tencent, NJU, and PKU. No specific community channels (Discord/Slack) are listed, but an issue contact email (adamwzhang@tencent.com) is provided.
Licensing & Compatibility
The repository states it is licensed under a LICENSE file. However, the specific license type and its implications for commercial use or closed-source linking are not detailed within the provided README content.
Limitations & Caveats
The README does not explicitly detail known bugs or alpha status. Evaluation requires proprietary LLM API access (e.g., Gemini), potentially limiting adoption. The specific license terms are not disclosed in the README.
1 week ago
Inactive
lmarena
openai
SWE-bench