hh-rlhf  by anthropics

RLHF dataset for training safe AI assistants

created 3 years ago
1,767 stars

Top 24.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides human preference data for training helpful and harmless AI assistants, along with red teaming data for identifying and mitigating AI harms. It is intended for researchers focused on AI safety and alignment, offering valuable datasets for developing more robust and ethical language models.

How It Works

The project offers two primary datasets in JSON Lines format. The first contains pairs of AI assistant responses, labeled "chosen" and "rejected," reflecting human preferences on helpfulness and harmlessness. The second dataset comprises detailed transcripts of red teaming interactions, including adversary prompts, AI responses, harmlessness scores, and red teamer ratings, enabling analysis of model vulnerabilities.

Quick Start & Requirements

  • Data is available for download directly from the repository.
  • No specific software prerequisites are mentioned for accessing or using the data files.

Highlighted Details

  • Includes human preference data from "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback."
  • Provides red teaming data from "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned."
  • Data covers topics including discriminatory language, abuse, violence, and self-harm, intended for research to reduce AI harms.
  • Preference data is structured as "chosen" vs. "rejected" text pairs.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license. Users should assume all rights are reserved by Anthropic unless otherwise specified.
  • Commercial use and closed-source linking compatibility are not specified.

Limitations & Caveats

The dataset contains potentially offensive or upsetting content, including discussions of abuse and violence. Users must assess their own risk tolerance before engaging with the data. The views expressed within the data do not represent Anthropic.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
44 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.