gpt-2-output-dataset by openai

Dataset of GPT-2 outputs for AI research

Created 6 years ago

2,013 stars

Top 21.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ivan Zhang

Cofounder of Cohere

Project Summary

This dataset provides 250K GPT-2 model outputs for each of the GPT-2 model sizes (117M, 345M, 762M, 1542M), including both random sampling and Top-K (k=40) generated text. It also includes a fine-tuned model sample for Amazon reviews, enabling research into text generation detection, bias analysis, and model behavior.

How It Works

The dataset comprises JSONL files containing text generated by various GPT-2 models. For each model, outputs are provided for a training split (250K examples) and validation/test splits (5K examples each). Two generation strategies are included: random sampling (temperature 1) and Top-K sampling with k=40, offering diverse output characteristics for analysis.

Quick Start & Requirements

Download all data using the provided download_dataset.py script.
Data is hosted on Azure Blob Storage: https://openaipublic.blob.core.windows.net/gpt-2/output-dataset/v1/
Official documentation and download script: https://github.com/openai/gpt-2-output-dataset

Highlighted Details

Includes outputs from GPT-2 models ranging from 117M to 1542M parameters.
Features a GPT-2 model fine-tuned on Amazon reviews for detection research.
Provides detectability baselines achieving mid-90s accuracy for Top-K 40 generations.
Notes that fine-tuning may evade detection methods.

Maintenance & Community

Maintained by OpenAI.
Data removal requests can be sent to webtextdata@openai.com.

Licensing & Compatibility

The dataset is released under a permissive license, allowing for research and commercial use. Specific license details are not explicitly stated in the README, but the context implies open research use.

Limitations & Caveats

The dataset is a snapshot of GPT-2 outputs and may not reflect the latest model capabilities or generation techniques. The README mentions that fine-tuning can evade detection, suggesting limitations in the robustness of detection methods applied to this data.

gpt-2-output-dataset by openai

Explore Similar Projects

GPT-Fathom by GPT-Fathom

llm-decontaminator by lm-sys

RAGLAB by fate-ubw

PyCodeGPT by microsoft

GPT2 by affjljoo3581

fast-detect-gpt by baoguangsheng

WeightWatcher by CalculatedContent

mle-bench by openai

helm by stanford-crfm

finetune-transformer-lm by openai

cleanlab by cleanlab

gpt-2 by openai