writing by lechmazur

LLM benchmark for creative story writing

Created 1 year ago

347 stars

Top 80.4% on SourcePulse

Project Summary

This benchmark evaluates Large Language Models (LLMs) on their ability to creatively incorporate 10 mandatory story elements into short narratives. It targets researchers and developers working on creative LLM applications, providing a standardized method to compare model performance in constraint satisfaction and literary quality.

How It Works

The benchmark generates short stories (400-500 words) that must organically integrate a set of 10 random elements (character, object, concept, attribute, action, method, setting, timeframe, motivation, tone). Six LLMs then grade each story across 16 criteria, focusing on element integration and literary aspects like character development, plot coherence, and atmosphere. This multi-LLM grading approach aims for robust and nuanced evaluation.

Quick Start & Requirements

This repository provides benchmark results and analysis. Running the benchmark itself requires significant computational resources and access to multiple LLM APIs. The README details the methodology and presents extensive results tables and visualizations.

Highlighted Details

Evaluates 27 LLMs using updated grading models including GPT-4o Mar 2025, Claude 3.7 Sonnet, and Llama 4 Maverick.
Measures both constraint satisfaction (element incorporation) and literary quality (cohesion, creativity).
Includes detailed per-question summaries and grader-LLM correlation analysis.
Provides examples of top and bottom-ranked individual stories with their required elements.

Maintenance & Community

The project is maintained by Lech Mazur, with updates provided on X (@lechmazur). The README lists recent updates and additions of new LLMs to the benchmark.

Licensing & Compatibility

The repository content is not explicitly licensed in the README. The project appears to be for research and benchmarking purposes.

Limitations & Caveats

The benchmark grades stories individually, potentially missing repetitive creative patterns across a model's outputs. Future assessments aim to address originality and variety. The accuracy of LLM grading is noted as requiring human validation, though high correlations suggest real measurement.

writing by lechmazur

Explore Similar Projects

AidanBench by aidanmclaughlin

EQ-Bench by EQ-bench

Awesome-Story-Generation by yingpengma

stop-slop by hardikpandya

ComoRAG by EternityJune25

emnlp22-re3-story-generation by yangkevin2

SEED-Story by TencentARC

prompts by lijigang

chinese-novelist-skill by PenglongHuang

awesome-llm-interpretability by JShollaj

ai-book-writer by adamwlarson

arboris-novel by t59688