hh-rlhf by anthropics

RLHF dataset for training safe AI assistants

Created 3 years ago

1,809 stars

Top 23.6% on SourcePulse

View on GitHub

9 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luca Soldaini

Research Scientist at Ai2

Vincent Weisser

Cofounder of Prime Intellect

Travis Fischer

Founder of Agentic

and 5 more!

Project Summary

This repository provides human preference data for training helpful and harmless AI assistants, along with red teaming data for identifying and mitigating AI harms. It is intended for researchers focused on AI safety and alignment, offering valuable datasets for developing more robust and ethical language models.

How It Works

The project offers two primary datasets in JSON Lines format. The first contains pairs of AI assistant responses, labeled "chosen" and "rejected," reflecting human preferences on helpfulness and harmlessness. The second dataset comprises detailed transcripts of red teaming interactions, including adversary prompts, AI responses, harmlessness scores, and red teamer ratings, enabling analysis of model vulnerabilities.

Quick Start & Requirements

Data is available for download directly from the repository.
No specific software prerequisites are mentioned for accessing or using the data files.

Highlighted Details

Includes human preference data from "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback."
Provides red teaming data from "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned."
Data covers topics including discriminatory language, abuse, violence, and self-harm, intended for research to reduce AI harms.
Preference data is structured as "chosen" vs. "rejected" text pairs.

Maintenance & Community

Contact for inquiries: redteam@anthropic.com.

Licensing & Compatibility

The repository does not explicitly state a license. Users should assume all rights are reserved by Anthropic unless otherwise specified.
Commercial use and closed-source linking compatibility are not specified.

Limitations & Caveats

The dataset contains potentially offensive or upsetting content, including discussions of abuse and violence. Users must assess their own risk tolerance before engaging with the data. The views expressed within the data do not represent Anthropic.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days