Safety-Prompts by thu-coai

Chinese safety prompts for LLM evaluation/alignment

Created 2 years ago

1,121 stars

Top 34.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository provides a comprehensive dataset of 100,000 Chinese safety prompts and corresponding ChatGPT responses, designed to evaluate and improve the safety alignment of Large Language Models (LLMs). It targets researchers and developers working on LLM safety, offering a valuable resource for training, fine-tuning, and assessing models against various safety scenarios and instruction attacks.

How It Works

The dataset is structured into typical safety scenarios (e.g., insults, bias, illegal activities, harm) and instruction attacks (e.g., goal hijacking, prompt leaking). Prompts are generated to probe LLM vulnerabilities, and responses are provided by ChatGPT (GPT-3.5-turbo) to serve as a benchmark. The project also introduces ShieldLM, a framework for LLMs to act as aligned, customizable, and explainable safety detectors, and SafetyBench, a multiple-choice evaluation platform for LLM safety.

Quick Start & Requirements

Data Access: Download typical_safety_scenarios.json and instruction_attack_scenarios.json directly from the repository.

HuggingFace Datasets:

from datasets import load_dataset
safetyprompts = load_dataset("thu-coai/Safety-Prompts", data_files='typical_safety_scenarios.json', field='Insult', split='train')
print(safetyprompts[0])

Prerequisites: Python, HuggingFace datasets library.

Highlighted Details

100k Chinese safety prompts covering 7 typical safety scenarios and 6 instruction attack types.
Includes detailed statistics on prompt and response lengths per category.
Introduces ShieldLM for LLM-based safety detection and SafetyBench for multi-choice evaluation.
Sample data and usage examples are provided for quick integration.

Maintenance & Community

The project is associated with Tsinghua University's COAI lab. Further details on related projects like ShieldLM and SafetyBench can be found via the provided links.

Licensing & Compatibility

The dataset is intended solely for evaluating and improving the safety of Chinese LLMs and does not represent the views of the research group. Specific licensing details for commercial use or closed-source linking are not explicitly stated in the README.

Limitations & Caveats

Some prompts may be imperfectly phrased due to model augmentation, and certain categories might have limited prompt diversity. While most responses are safe, a minority may still exhibit unsafe behavior or occasional English responses from ChatGPT. The dataset does not cover all potential safety issues, and there are no plans to release sensitive topic data.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days