Dataset of GPT-2 outputs for AI research
Top 22.6% on sourcepulse
This dataset provides 250K GPT-2 model outputs for each of the GPT-2 model sizes (117M, 345M, 762M, 1542M), including both random sampling and Top-K (k=40) generated text. It also includes a fine-tuned model sample for Amazon reviews, enabling research into text generation detection, bias analysis, and model behavior.
How It Works
The dataset comprises JSONL files containing text generated by various GPT-2 models. For each model, outputs are provided for a training split (250K examples) and validation/test splits (5K examples each). Two generation strategies are included: random sampling (temperature 1) and Top-K sampling with k=40, offering diverse output characteristics for analysis.
Quick Start & Requirements
download_dataset.py
script.https://openaipublic.blob.core.windows.net/gpt-2/output-dataset/v1/
Highlighted Details
Maintenance & Community
webtextdata@openai.com
.Licensing & Compatibility
Limitations & Caveats
The dataset is a snapshot of GPT-2 outputs and may not reflect the latest model capabilities or generation techniques. The README mentions that fine-tuning can evade detection, suggesting limitations in the robustness of detection methods applied to this data.
1 year ago
Inactive