awesome-llm-human-preference-datasets  by glgh

Curated list of human preference datasets for LLM training

created 2 years ago
371 stars

Top 77.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository curates human preference datasets crucial for fine-tuning Large Language Models (LLMs), particularly for Reinforcement Learning from Human Feedback (RLHF) and evaluation. It serves researchers and developers aiming to align LLM behavior with human values and preferences, offering a centralized resource for high-quality, human-annotated data.

How It Works

The list compiles datasets derived from various sources, including direct human annotations, comparisons of model-generated outputs, and crowd-sourced conversational data. These datasets typically contain prompts, multiple model responses, and human-assigned preference scores or quality ratings, enabling the training of reward models and the evaluation of LLM alignment.

Quick Start & Requirements

  • Datasets are primarily accessed via HuggingFace Hub links or direct download.
  • Requirements vary per dataset, often including Python environments and libraries like datasets from HuggingFace.
  • Links to sample data and specific dataset documentation are provided within the list.

Highlighted Details

  • Includes OpenAI's WebGPT and Summarization datasets, foundational for RLHF research.
  • Features Anthropic's HH-RLHF dataset with 170k comparisons covering helpfulness and harmlessness.
  • Lists OpenAssistant Conversations Dataset (OASST1) with 161k messages across 35 languages.
  • Highlights Stanford's SHP dataset with 385K collective human preferences.

Maintenance & Community

  • The list is maintained by glgh.
  • Links to related "awesome" lists for general NLP datasets are provided.

Licensing & Compatibility

  • Licenses vary per dataset; users must consult individual dataset licenses for usage terms.
  • Compatibility for commercial use depends on the specific dataset's license.

Limitations & Caveats

The ShareGPT.com data access via API is currently disabled due to excess traffic. Some datasets may have specific access requirements or usage restrictions detailed in their respective licenses.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.