Chinese safety prompts for LLM evaluation/alignment
Top 36.4% on sourcepulse
This repository provides a comprehensive dataset of 100,000 Chinese safety prompts and corresponding ChatGPT responses, designed to evaluate and improve the safety alignment of Large Language Models (LLMs). It targets researchers and developers working on LLM safety, offering a valuable resource for training, fine-tuning, and assessing models against various safety scenarios and instruction attacks.
How It Works
The dataset is structured into typical safety scenarios (e.g., insults, bias, illegal activities, harm) and instruction attacks (e.g., goal hijacking, prompt leaking). Prompts are generated to probe LLM vulnerabilities, and responses are provided by ChatGPT (GPT-3.5-turbo) to serve as a benchmark. The project also introduces ShieldLM, a framework for LLMs to act as aligned, customizable, and explainable safety detectors, and SafetyBench, a multiple-choice evaluation platform for LLM safety.
Quick Start & Requirements
typical_safety_scenarios.json
and instruction_attack_scenarios.json
directly from the repository.from datasets import load_dataset
safetyprompts = load_dataset("thu-coai/Safety-Prompts", data_files='typical_safety_scenarios.json', field='Insult', split='train')
print(safetyprompts[0])
datasets
library.Highlighted Details
Maintenance & Community
The project is associated with Tsinghua University's COAI lab. Further details on related projects like ShieldLM and SafetyBench can be found via the provided links.
Licensing & Compatibility
The dataset is intended solely for evaluating and improving the safety of Chinese LLMs and does not represent the views of the research group. Specific licensing details for commercial use or closed-source linking are not explicitly stated in the README.
Limitations & Caveats
Some prompts may be imperfectly phrased due to model augmentation, and certain categories might have limited prompt diversity. While most responses are safe, a minority may still exhibit unsafe behavior or occasional English responses from ChatGPT. The dataset does not cover all potential safety issues, and there are no plans to release sensitive topic data.
1 year ago
1 week