heretic  by p-e-w

Automatic LLM censorship removal

Created 3 months ago
4,141 stars

Top 11.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Heretic provides a fully automatic solution for removing censorship ("safety alignment") from transformer-based language models. It targets engineers and researchers who need to adapt LLMs for broader applications without the need for expensive post-training, offering a way to decensor models while preserving their original intelligence and capabilities.

How It Works

Heretic combines an advanced implementation of directional ablation ("abliteration") with a TPE-based parameter optimizer from Optuna. The tool automatically identifies optimal ablation parameters by simultaneously minimizing model refusals and the KL divergence from the original model. This co-minimization strategy ensures that the decensored model retains as much of its original intelligence as possible, and the process requires no specialized knowledge of transformer internals, making it accessible to users familiar with command-line operations.

Quick Start & Requirements

  • Installation: pip install heretic-llm
  • Prerequisites: Python 3.10+ environment with PyTorch 2.2+ installed, suitable for your hardware.
  • Usage: Run heretic <model_name_or_path> from the command line.
  • Hardware: GPU is recommended; benchmarks were conducted on RTX 5090/3090.
  • Resource Footprint: Decensoring Llama-3.1-8B takes approximately 45 minutes on an RTX 3090 with default settings.
  • Links: A collection of models decensored using Heretic is available on Hugging Face.

Highlighted Details

  • Achieves comparable refusal suppression to manually curated models but with significantly lower KL divergence, indicating less damage to original capabilities (e.g., 0.16 KL divergence for Heretic's Gemma-3-12B-it vs. 0.45-1.04 for others).
  • The process is fully automatic, requiring no manual configuration for basic use, though extensive parameters are available for advanced control.
  • Supports most dense models and several MoE architectures, including many multimodal variants.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

  • License: GNU Affero General Public License v3 or later.
  • Compatibility: This is a strong copyleft license. Any modifications or derivative works, especially if distributed or made available over a network, must be released under the same AGPL license, requiring source code availability.

Limitations & Caveats

Heretic does not yet support models based on State Space Models (SSMs), hybrid architectures, models with inhomogeneous layers, or certain novel attention systems.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
20
Star History
409 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

prompt-lookup-decoding by apoorvumang

0.2%
584
Decoding method for faster LLM generation
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.