OBLITERATUS by elder-plinius

Advanced toolkit for liberating LLMs from refusal behaviors

Created 3 months ago

6,303 stars

Top 8.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Guido

Cofounder of Trail of Bits

Project Summary

OBLITERATUS is an advanced open-source toolkit for understanding and removing refusal behaviors from large language models (LLMs) without retraining. It targets researchers and engineers seeking to liberate models from artificial gatekeeping while preserving core capabilities. The project also serves as a distributed research experiment, crowdsourcing data to advance LLM interpretability.

How It Works

The project implements "abliteration," a technique that identifies and surgically removes internal representations responsible for content refusal. This involves probing model states, extracting refusal directions via methods like SVD, and intervening at inference time. This approach enables precise model liberation without retraining, preserving general language and reasoning abilities, and contributes to a growing research dataset.

Quick Start & Requirements

Primary Install/Run: HuggingFace Spaces (zero setup), local Gradio UI (pip install -e ".[spaces]" then obliteratus ui), Google Colab, CLI (pip install -e . then obliteratus obliterate ...), Python API.
Prerequisites: Any HuggingFace transformer model. GPU recommended for local execution.
Links: HuggingFace Spaces demo available.

Highlighted Details

"Abliteration" techniques for precise refusal removal without retraining.
15 analysis modules for deep mapping of refusal mechanisms (e.g., Concept Cone Geometry, Alignment Imprint Detection).
Novel techniques like Expert-Granular Abliteration (EGA) and Analysis-Informed Pipeline for auto-configured liberation.
Supports both permanent weight projection and reversible steering vectors.
Crowd-sourced research platform via opt-in telemetry, contributing to a dataset on refusal universality.

Maintenance & Community

Community-driven research central to the project.
Opt-in telemetry and PRs contribute to a shared dataset and leaderboard.
Active development indicated by novel techniques and an extensive test suite (837 tests).

Licensing & Compatibility

License: Dual-licensed: AGPL-3.0 (open source) and Commercial.
Restrictions: AGPL-3.0 requires source disclosure for network services; commercial license available for proprietary SaaS or closed-source products.

Limitations & Caveats

The AGPL-3.0 license's network service clause may require a commercial license for certain deployments. Users opting out of telemetry will not contribute to the shared research dataset. The advanced analytical features may present a steep learning curve.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

859 stars in the last 30 days