RLHF dataset for training safe AI assistants
Top 24.8% on sourcepulse
This repository provides human preference data for training helpful and harmless AI assistants, along with red teaming data for identifying and mitigating AI harms. It is intended for researchers focused on AI safety and alignment, offering valuable datasets for developing more robust and ethical language models.
How It Works
The project offers two primary datasets in JSON Lines format. The first contains pairs of AI assistant responses, labeled "chosen" and "rejected," reflecting human preferences on helpfulness and harmlessness. The second dataset comprises detailed transcripts of red teaming interactions, including adversary prompts, AI responses, harmlessness scores, and red teamer ratings, enabling analysis of model vulnerabilities.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset contains potentially offensive or upsetting content, including discussions of abuse and violence. Users must assess their own risk tolerance before engaging with the data. The views expressed within the data do not represent Anthropic.
1 month ago
Inactive