gpt-2-output-dataset  by openai

Dataset of GPT-2 outputs for AI research

created 6 years ago
1,991 stars

Top 22.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This dataset provides 250K GPT-2 model outputs for each of the GPT-2 model sizes (117M, 345M, 762M, 1542M), including both random sampling and Top-K (k=40) generated text. It also includes a fine-tuned model sample for Amazon reviews, enabling research into text generation detection, bias analysis, and model behavior.

How It Works

The dataset comprises JSONL files containing text generated by various GPT-2 models. For each model, outputs are provided for a training split (250K examples) and validation/test splits (5K examples each). Two generation strategies are included: random sampling (temperature 1) and Top-K sampling with k=40, offering diverse output characteristics for analysis.

Quick Start & Requirements

  • Download all data using the provided download_dataset.py script.
  • Data is hosted on Azure Blob Storage: https://openaipublic.blob.core.windows.net/gpt-2/output-dataset/v1/
  • Official documentation and download script: https://github.com/openai/gpt-2-output-dataset

Highlighted Details

  • Includes outputs from GPT-2 models ranging from 117M to 1542M parameters.
  • Features a GPT-2 model fine-tuned on Amazon reviews for detection research.
  • Provides detectability baselines achieving mid-90s accuracy for Top-K 40 generations.
  • Notes that fine-tuning may evade detection methods.

Maintenance & Community

  • Maintained by OpenAI.
  • Data removal requests can be sent to webtextdata@openai.com.

Licensing & Compatibility

  • The dataset is released under a permissive license, allowing for research and commercial use. Specific license details are not explicitly stated in the README, but the context implies open research use.

Limitations & Caveats

The dataset is a snapshot of GPT-2 outputs and may not reflect the latest model capabilities or generation techniques. The README mentions that fine-tuning can evade detection, suggesting limitations in the robustness of detection methods applied to this data.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

GPT2 by ConnorJL

0%
1k
GPT2 training implementation, supporting TPUs and GPUs
created 6 years ago
updated 2 years ago
Feedback? Help us improve.