Research paper and dataset for LLM service behavior analysis over time
Top 82.8% on sourcepulse
This repository addresses the opacity surrounding updates to large language models (LLMs) like GPT-4 and GPT-3.5 by providing datasets and historical generations. It enables researchers and users to track and understand behavioral shifts in LLM services over time, highlighting performance variations and potential degradations.
How It Works
The project collects diverse datasets and prompts LLMs, capturing their responses across different time points. This allows for quantitative analysis of performance drifts, as demonstrated by examples like GPT-4's accuracy drop in prime number identification between March and June 2023. The approach facilitates empirical study of LLM evolution without requiring direct API access for reproduction.
Quick Start & Requirements
generation/
. Each CSV contains model, query parameters, query, reference answer, generated answer, and latency.Highlighted Details
Maintenance & Community
The project is associated with researchers from Stanford University. Updates are logged in the changelog, with the initial release in July 2023.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Obtaining new generations requires an OpenAI API key, and the specific LLM versions tested are tied to the dates of generation. The README does not detail the underlying infrastructure or specific hardware used for the original generations.
1 year ago
Inactive