writing  by lechmazur

LLM benchmark for creative story writing

created 7 months ago
269 stars

Top 96.2% on sourcepulse

GitHubView on GitHub
Project Summary

This benchmark evaluates Large Language Models (LLMs) on their ability to creatively incorporate 10 mandatory story elements into short narratives. It targets researchers and developers working on creative LLM applications, providing a standardized method to compare model performance in constraint satisfaction and literary quality.

How It Works

The benchmark generates short stories (400-500 words) that must organically integrate a set of 10 random elements (character, object, concept, attribute, action, method, setting, timeframe, motivation, tone). Six LLMs then grade each story across 16 criteria, focusing on element integration and literary aspects like character development, plot coherence, and atmosphere. This multi-LLM grading approach aims for robust and nuanced evaluation.

Quick Start & Requirements

This repository provides benchmark results and analysis. Running the benchmark itself requires significant computational resources and access to multiple LLM APIs. The README details the methodology and presents extensive results tables and visualizations.

Highlighted Details

  • Evaluates 27 LLMs using updated grading models including GPT-4o Mar 2025, Claude 3.7 Sonnet, and Llama 4 Maverick.
  • Measures both constraint satisfaction (element incorporation) and literary quality (cohesion, creativity).
  • Includes detailed per-question summaries and grader-LLM correlation analysis.
  • Provides examples of top and bottom-ranked individual stories with their required elements.

Maintenance & Community

The project is maintained by Lech Mazur, with updates provided on X (@lechmazur). The README lists recent updates and additions of new LLMs to the benchmark.

Licensing & Compatibility

The repository content is not explicitly licensed in the README. The project appears to be for research and benchmarking purposes.

Limitations & Caveats

The benchmark grades stories individually, potentially missing repetitive creative patterns across a model's outputs. Future assessments aim to address originality and variety. The accuracy of LLM grading is noted as requiring human validation, though high correlations suggest real measurement.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
74 stars in the last 90 days

Explore Similar Projects

Starred by Jerry Liu Jerry Liu(Cofounder of LlamaIndex), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

hallucination-leaderboard by vectara

0.9%
3k
LLM leaderboard for hallucination detection in summarization
created 1 year ago
updated 1 day ago
Feedback? Help us improve.