LLM benchmark for creative story writing
Top 96.2% on sourcepulse
This benchmark evaluates Large Language Models (LLMs) on their ability to creatively incorporate 10 mandatory story elements into short narratives. It targets researchers and developers working on creative LLM applications, providing a standardized method to compare model performance in constraint satisfaction and literary quality.
How It Works
The benchmark generates short stories (400-500 words) that must organically integrate a set of 10 random elements (character, object, concept, attribute, action, method, setting, timeframe, motivation, tone). Six LLMs then grade each story across 16 criteria, focusing on element integration and literary aspects like character development, plot coherence, and atmosphere. This multi-LLM grading approach aims for robust and nuanced evaluation.
Quick Start & Requirements
This repository provides benchmark results and analysis. Running the benchmark itself requires significant computational resources and access to multiple LLM APIs. The README details the methodology and presents extensive results tables and visualizations.
Highlighted Details
Maintenance & Community
The project is maintained by Lech Mazur, with updates provided on X (@lechmazur). The README lists recent updates and additions of new LLMs to the benchmark.
Licensing & Compatibility
The repository content is not explicitly licensed in the README. The project appears to be for research and benchmarking purposes.
Limitations & Caveats
The benchmark grades stories individually, potentially missing repetitive creative patterns across a model's outputs. Future assessments aim to address originality and variety. The accuracy of LLM grading is noted as requiring human validation, though high correlations suggest real measurement.
2 days ago
Inactive