evals  by anthropics

Model-written datasets for evaluating language model behaviors

created 2 years ago
286 stars

Top 92.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides datasets generated by language models for evaluating AI behaviors, targeting researchers and developers interested in model-generated data quality and specific AI characteristics. The datasets enable evaluation of models for persona consistency, sycophancy, advanced AI risks, and gender bias, offering insights into model behavior beyond standard benchmarks.

How It Works

The datasets are designed to probe dialogue agents and other language models through carefully crafted prompts. They are generated using a few-shot approach, aiming to elicit specific behaviors related to persona, agreement, and AI safety concerns. The inclusion of human-written datasets allows for direct comparison and validation of model-generated evaluation quality.

Quick Start & Requirements

  • Datasets are available within the repository's directory structure (e.g., persona/, sycophancy/).
  • No specific installation commands are provided; data is intended for direct use or adaptation.
  • Requires a language model or framework capable of processing dialogue or text-based evaluations.
  • Further details on generation and validation are available in the associated paper: https://arxiv.org/abs/2212.09251.

Highlighted Details

  • Datasets cover persona, sycophancy, advanced AI risks, and gender bias.
  • Includes model-generated versions of existing datasets (e.g., Winogender).
  • Human-written datasets are provided for comparative analysis.
  • Data generation methodology is detailed in the linked paper.

Maintenance & Community

  • Developed by Anthropic.
  • Contact for questions: ethan@anthropic.com.
  • BibTeX citation provided for the associated paper.

Licensing & Compatibility

  • Copyright: arXiv.org perpetual, non-exclusive license.
  • Compatibility for commercial use or closed-source linking is not explicitly stated but implied by the arXiv license.

Limitations & Caveats

Some datasets contain social biases, stereotypes, and potentially harmful or offensive content, reflecting the nature of the data generation process and the views expressed within it.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
2 more.

hh-rlhf by anthropics

0.2%
2k
RLHF dataset for training safe AI assistants
created 3 years ago
updated 1 month ago
Feedback? Help us improve.