evals by anthropics

Model-written datasets for evaluating language model behaviors

Created 3 years ago

321 stars

Top 84.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Travis Fischer

Founder of Agentic

Evan Hubinger

Head of Alignment Stress-Testing at Anthropic

Ethan Perez

Research Scientist at Anthropic

Project Summary

This repository provides datasets generated by language models for evaluating AI behaviors, targeting researchers and developers interested in model-generated data quality and specific AI characteristics. The datasets enable evaluation of models for persona consistency, sycophancy, advanced AI risks, and gender bias, offering insights into model behavior beyond standard benchmarks.

How It Works

The datasets are designed to probe dialogue agents and other language models through carefully crafted prompts. They are generated using a few-shot approach, aiming to elicit specific behaviors related to persona, agreement, and AI safety concerns. The inclusion of human-written datasets allows for direct comparison and validation of model-generated evaluation quality.

Quick Start & Requirements

Datasets are available within the repository's directory structure (e.g., persona/, sycophancy/).
No specific installation commands are provided; data is intended for direct use or adaptation.
Requires a language model or framework capable of processing dialogue or text-based evaluations.
Further details on generation and validation are available in the associated paper: https://arxiv.org/abs/2212.09251.

Highlighted Details

Datasets cover persona, sycophancy, advanced AI risks, and gender bias.
Includes model-generated versions of existing datasets (e.g., Winogender).
Human-written datasets are provided for comparative analysis.
Data generation methodology is detailed in the linked paper.

Maintenance & Community

Developed by Anthropic.
Contact for questions: ethan@anthropic.com.
BibTeX citation provided for the associated paper.

Licensing & Compatibility

Copyright: arXiv.org perpetual, non-exclusive license.
Compatibility for commercial use or closed-source linking is not explicitly stated but implied by the arXiv license.

Limitations & Caveats

Some datasets contain social biases, stereotypes, and potentially harmful or offensive content, reflecting the nature of the data generation process and the views expressed within it.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days