Dataset for evaluating LLM safety mechanisms
Top 96.5% on sourcepulse
This repository provides the "Do-Not-Answer" dataset and evaluation tools for assessing the safety mechanisms of Large Language Models (LLMs). It targets AI researchers and developers seeking to measure and improve LLM safety by identifying prompts that responsible models should refuse to answer. The dataset offers a structured way to evaluate LLM responses across various harm categories.
How It Works
The project curates prompts designed to elicit harmful or unsafe responses from LLMs, categorizing them using a hierarchical taxonomy of 61 specific harms across five risk areas. It includes both human annotations and a model-based evaluator (a fine-tuned 600M BERT-like model) that achieves performance comparable to human and GPT-4 evaluations for assessing response safety and action categories.
Quick Start & Requirements
do_not_answer/utils/info.yaml
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
GPT-4's performance in human evaluation shows random guessing on harmful responses and misclassification of action categories, suggesting potential limitations in automated safety evaluation.
1 year ago
1 week