LLMs-Finetuning-Safety by LLM-Tuning-Safety

Research paper on LLM finetuning and safety degradation

Created 2 years ago

336 stars

Top 81.9% on SourcePulse

Project Summary

This repository demonstrates how fine-tuning aligned Large Language Models (LLMs) can significantly degrade their safety guardrails, even with minimal, carefully crafted datasets. It targets LLM developers and researchers concerned with model safety and robustness, providing evidence that standard fine-tuning practices can inadvertently introduce vulnerabilities.

How It Works

The project fine-tunes LLMs like GPT-3.5 Turbo and Llama-2-7b-Chat on small, adversarially designed datasets. These datasets range from explicitly harmful examples to implicitly harmful ones that prioritize obedience. The core idea is that fine-tuning, by maximizing the likelihood of specific outputs, can override pre-existing safety alignments, making models susceptible to harmful instructions.

Quick Start & Requirements

Installation: Follow instructions within the gpt-3.5 or llama2 directories.
Prerequisites: OpenAI API access (for GPT-3.5 experiments), Python, and relevant libraries (e.g., PyTorch, Transformers). CUDA is recommended for Llama-2 fine-tuning.
Resources: Fine-tuning GPT-3.5 Turbo on 10 examples costs less than $0.20.
Links: arXiv, Project Page, Dataset (gated access).

Highlighted Details

Demonstrates jailbreaking GPT-3.5 Turbo with just 10 adversarially designed examples for under $0.20.
Shows that fine-tuning on implicitly harmful datasets (e.g., identity-shifting examples) can also lead to widespread safety degradation.
Identifies that larger learning rates and smaller batch sizes exacerbate safety degradation during fine-tuning.
Introduces a new safety evaluation benchmark based on OpenAI and Llama-2 usage policies.

Maintenance & Community

The project is associated with researchers from Princeton University, Virginia Tech, IBM Research, and Stanford University. OpenAI provided $5,000 in API credits for this research.

Licensing & Compatibility

The dataset is released under a custom license with gated access, requiring contact information and manual review for access, potentially limiting availability to specific affiliations. The code's license is not explicitly stated in the README.

Limitations & Caveats

The primary dataset is gated due to highly offensive content, impacting reproducibility for those without access. While alternative public datasets are used for supplementary results, the most "practically harmful" cases are restricted. Mitigation strategies may have been deployed by OpenAI post-disclosure, affecting direct replication of the original findings.

LLMs-Finetuning-Safety by LLM-Tuning-Safety

Explore Similar Projects

SafetyBench by thu-coai

Cherry_LLM by tianyi-lab

Qwen3Guard by QwenLM

do-not-answer by Libr-AI

data_management_LLM by ZigeW

jade-db by whitzard-ai

DecodingTrust by AI-secure

PandaLM by WeOpenML

Safety-Prompts by thu-coai

Awesome-LLM-Safety by ydyjya

Generalization-Causality by yfzhang114

pythia by EleutherAI