Research paper on LLM finetuning and safety degradation
Top 87.4% on sourcepulse
This repository demonstrates how fine-tuning aligned Large Language Models (LLMs) can significantly degrade their safety guardrails, even with minimal, carefully crafted datasets. It targets LLM developers and researchers concerned with model safety and robustness, providing evidence that standard fine-tuning practices can inadvertently introduce vulnerabilities.
How It Works
The project fine-tunes LLMs like GPT-3.5 Turbo and Llama-2-7b-Chat on small, adversarially designed datasets. These datasets range from explicitly harmful examples to implicitly harmful ones that prioritize obedience. The core idea is that fine-tuning, by maximizing the likelihood of specific outputs, can override pre-existing safety alignments, making models susceptible to harmful instructions.
Quick Start & Requirements
gpt-3.5
or llama2
directories.Highlighted Details
Maintenance & Community
The project is associated with researchers from Princeton University, Virginia Tech, IBM Research, and Stanford University. OpenAI provided $5,000 in API credits for this research.
Licensing & Compatibility
The dataset is released under a custom license with gated access, requiring contact information and manual review for access, potentially limiting availability to specific affiliations. The code's license is not explicitly stated in the README.
Limitations & Caveats
The primary dataset is gated due to highly offensive content, impacting reproducibility for those without access. While alternative public datasets are used for supplementary results, the most "practically harmful" cases are restricted. Mitigation strategies may have been deployed by OpenAI post-disclosure, affecting direct replication of the original findings.
1 year ago
1 week