LLMs-Finetuning-Safety  by LLM-Tuning-Safety

Research paper on LLM finetuning and safety degradation

created 1 year ago
313 stars

Top 87.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository demonstrates how fine-tuning aligned Large Language Models (LLMs) can significantly degrade their safety guardrails, even with minimal, carefully crafted datasets. It targets LLM developers and researchers concerned with model safety and robustness, providing evidence that standard fine-tuning practices can inadvertently introduce vulnerabilities.

How It Works

The project fine-tunes LLMs like GPT-3.5 Turbo and Llama-2-7b-Chat on small, adversarially designed datasets. These datasets range from explicitly harmful examples to implicitly harmful ones that prioritize obedience. The core idea is that fine-tuning, by maximizing the likelihood of specific outputs, can override pre-existing safety alignments, making models susceptible to harmful instructions.

Quick Start & Requirements

  • Installation: Follow instructions within the gpt-3.5 or llama2 directories.
  • Prerequisites: OpenAI API access (for GPT-3.5 experiments), Python, and relevant libraries (e.g., PyTorch, Transformers). CUDA is recommended for Llama-2 fine-tuning.
  • Resources: Fine-tuning GPT-3.5 Turbo on 10 examples costs less than $0.20.
  • Links: arXiv, Project Page, Dataset (gated access).

Highlighted Details

  • Demonstrates jailbreaking GPT-3.5 Turbo with just 10 adversarially designed examples for under $0.20.
  • Shows that fine-tuning on implicitly harmful datasets (e.g., identity-shifting examples) can also lead to widespread safety degradation.
  • Identifies that larger learning rates and smaller batch sizes exacerbate safety degradation during fine-tuning.
  • Introduces a new safety evaluation benchmark based on OpenAI and Llama-2 usage policies.

Maintenance & Community

The project is associated with researchers from Princeton University, Virginia Tech, IBM Research, and Stanford University. OpenAI provided $5,000 in API credits for this research.

Licensing & Compatibility

The dataset is released under a custom license with gated access, requiring contact information and manual review for access, potentially limiting availability to specific affiliations. The code's license is not explicitly stated in the README.

Limitations & Caveats

The primary dataset is gated due to highly offensive content, impacting reproducibility for those without access. While alternative public datasets are used for supplementary results, the most "practically harmful" cases are restricted. Mitigation strategies may have been deployed by OpenAI post-disclosure, affecting direct replication of the original findings.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Author of SWE-bench, SWE-agent), and
13 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.