heretic by p-e-w

Automatic LLM censorship removal

Created 9 months ago

25,993 stars

Top 1.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Theo Browne

Founder of Ping.gg

Dan Guido

Cofounder of Trail of Bits

and 1 more!

Project Summary

Heretic provides a fully automatic solution for removing censorship ("safety alignment") from transformer-based language models. It targets engineers and researchers who need to adapt LLMs for broader applications without the need for expensive post-training, offering a way to decensor models while preserving their original intelligence and capabilities.

How It Works

Heretic combines an advanced implementation of directional ablation ("abliteration") with a TPE-based parameter optimizer from Optuna. The tool automatically identifies optimal ablation parameters by simultaneously minimizing model refusals and the KL divergence from the original model. This co-minimization strategy ensures that the decensored model retains as much of its original intelligence as possible, and the process requires no specialized knowledge of transformer internals, making it accessible to users familiar with command-line operations.

Quick Start & Requirements

Installation: pip install heretic-llm
Prerequisites: Python 3.10+ environment with PyTorch 2.2+ installed, suitable for your hardware.
Usage: Run heretic <model_name_or_path> from the command line.
Hardware: GPU is recommended; benchmarks were conducted on RTX 5090/3090.
Resource Footprint: Decensoring Llama-3.1-8B takes approximately 45 minutes on an RTX 3090 with default settings.
Links: A collection of models decensored using Heretic is available on Hugging Face.

Highlighted Details

Achieves comparable refusal suppression to manually curated models but with significantly lower KL divergence, indicating less damage to original capabilities (e.g., 0.16 KL divergence for Heretic's Gemma-3-12B-it vs. 0.45-1.04 for others).
The process is fully automatic, requiring no manual configuration for basic use, though extensive parameters are available for advanced control.
Supports most dense models and several MoE architectures, including many multimodal variants.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

License: GNU Affero General Public License v3 or later.
Compatibility: This is a strong copyleft license. Any modifications or derivative works, especially if distributed or made available over a network, must be released under the same AGPL license, requiring source code availability.

Limitations & Caveats

Heretic does not yet support models based on State Space Models (SSMs), hybrid architectures, models with inhomogeneous layers, or certain novel attention systems.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,883 stars in the last 30 days