do-not-answer by Libr-AI

Dataset for evaluating LLM safety mechanisms

Created 2 years ago

303 stars

Top 88.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Nathan Lambert

Research Scientist at AI2

Project Summary

This repository provides the "Do-Not-Answer" dataset and evaluation tools for assessing the safety mechanisms of Large Language Models (LLMs). It targets AI researchers and developers seeking to measure and improve LLM safety by identifying prompts that responsible models should refuse to answer. The dataset offers a structured way to evaluate LLM responses across various harm categories.

How It Works

The project curates prompts designed to elicit harmful or unsafe responses from LLMs, categorizing them using a hierarchical taxonomy of 61 specific harms across five risk areas. It includes both human annotations and a model-based evaluator (a fine-tuned 600M BERT-like model) that achieves performance comparable to human and GPT-4 evaluations for assessing response safety and action categories.

Quick Start & Requirements

Installation: Refer to provided notebooks for usage details.
Prerequisites: Commercial model usage requires API information to be filled into do_not_answer/utils/info.yaml.
Resources: Access to LLM APIs (e.g., GPT-4) may be necessary for full evaluation.
Links: Paper, Dataset, Evaluator

Highlighted Details

Evaluates six LLMs (GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2) on safety.
Fine-tuned Longformer evaluator achieves comparable results to human and GPT-4.
Includes a Chinese version of the dataset with region-specific questions and localized harms.
Benchmarks show LLaMA-2 as the safest on English prompts but less safe on Chinese prompts.

Maintenance & Community

The project is associated with researchers from Stanford University.
A Chinese dataset extension is noted as "to appear in ACL 2024 findings."

Licensing & Compatibility

Datasets are released under CC BY-NC-SA 4.0.
Source files are released under Apache 2.0.
The non-commercial clause in the dataset license may restrict commercial use.

Limitations & Caveats

GPT-4's performance in human evaluation shows random guessing on harmful responses and misclassification of action categories, suggesting potential limitations in automated safety evaluation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days