Safety-Prompts  by thu-coai

Chinese safety prompts for LLM evaluation/alignment

created 2 years ago
1,051 stars

Top 36.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive dataset of 100,000 Chinese safety prompts and corresponding ChatGPT responses, designed to evaluate and improve the safety alignment of Large Language Models (LLMs). It targets researchers and developers working on LLM safety, offering a valuable resource for training, fine-tuning, and assessing models against various safety scenarios and instruction attacks.

How It Works

The dataset is structured into typical safety scenarios (e.g., insults, bias, illegal activities, harm) and instruction attacks (e.g., goal hijacking, prompt leaking). Prompts are generated to probe LLM vulnerabilities, and responses are provided by ChatGPT (GPT-3.5-turbo) to serve as a benchmark. The project also introduces ShieldLM, a framework for LLMs to act as aligned, customizable, and explainable safety detectors, and SafetyBench, a multiple-choice evaluation platform for LLM safety.

Quick Start & Requirements

  • Data Access: Download typical_safety_scenarios.json and instruction_attack_scenarios.json directly from the repository.
  • HuggingFace Datasets:
    from datasets import load_dataset
    safetyprompts = load_dataset("thu-coai/Safety-Prompts", data_files='typical_safety_scenarios.json', field='Insult', split='train')
    print(safetyprompts[0])
    
  • Prerequisites: Python, HuggingFace datasets library.

Highlighted Details

  • 100k Chinese safety prompts covering 7 typical safety scenarios and 6 instruction attack types.
  • Includes detailed statistics on prompt and response lengths per category.
  • Introduces ShieldLM for LLM-based safety detection and SafetyBench for multi-choice evaluation.
  • Sample data and usage examples are provided for quick integration.

Maintenance & Community

The project is associated with Tsinghua University's COAI lab. Further details on related projects like ShieldLM and SafetyBench can be found via the provided links.

Licensing & Compatibility

The dataset is intended solely for evaluating and improving the safety of Chinese LLMs and does not represent the views of the research group. Specific licensing details for commercial use or closed-source linking are not explicitly stated in the README.

Limitations & Caveats

Some prompts may be imperfectly phrased due to model augmentation, and certain categories might have limited prompt diversity. While most responses are safe, a minority may still exhibit unsafe behavior or occasional English responses from ChatGPT. The dataset does not cover all potential safety issues, and there are no plans to release sensitive topic data.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
49 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.