do-not-answer  by Libr-AI

Dataset for evaluating LLM safety mechanisms

created 1 year ago
268 stars

Top 96.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the "Do-Not-Answer" dataset and evaluation tools for assessing the safety mechanisms of Large Language Models (LLMs). It targets AI researchers and developers seeking to measure and improve LLM safety by identifying prompts that responsible models should refuse to answer. The dataset offers a structured way to evaluate LLM responses across various harm categories.

How It Works

The project curates prompts designed to elicit harmful or unsafe responses from LLMs, categorizing them using a hierarchical taxonomy of 61 specific harms across five risk areas. It includes both human annotations and a model-based evaluator (a fine-tuned 600M BERT-like model) that achieves performance comparable to human and GPT-4 evaluations for assessing response safety and action categories.

Quick Start & Requirements

  • Installation: Refer to provided notebooks for usage details.
  • Prerequisites: Commercial model usage requires API information to be filled into do_not_answer/utils/info.yaml.
  • Resources: Access to LLM APIs (e.g., GPT-4) may be necessary for full evaluation.
  • Links: Paper, Dataset, Evaluator

Highlighted Details

  • Evaluates six LLMs (GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2) on safety.
  • Fine-tuned Longformer evaluator achieves comparable results to human and GPT-4.
  • Includes a Chinese version of the dataset with region-specific questions and localized harms.
  • Benchmarks show LLaMA-2 as the safest on English prompts but less safe on Chinese prompts.

Maintenance & Community

  • The project is associated with researchers from Stanford University.
  • A Chinese dataset extension is noted as "to appear in ACL 2024 findings."

Licensing & Compatibility

  • Datasets are released under CC BY-NC-SA 4.0.
  • Source files are released under Apache 2.0.
  • The non-commercial clause in the dataset license may restrict commercial use.

Limitations & Caveats

GPT-4's performance in human evaluation shows random guessing on harmful responses and misclassification of action categories, suggesting potential limitations in automated safety evaluation.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.