ErisForge by Tsadoq

LLM modification library for controlled behavior alteration

Created 1 year ago

269 stars

Top 95.5% on SourcePulse

Project Summary

LLMs are complex systems, and understanding their internal workings or modifying their behavior for specific research or application needs can be challenging. ErisForge addresses this by providing a straightforward Python library to directly manipulate the internal layers of Large Language Models (LLMs). This allows researchers and developers to systematically ablate or augment model responses, creating modified versions for controlled experimentation and analysis, particularly useful for studying model safety and behavior.

How It Works

ErisForge enables targeted modifications to LLMs by applying transformations to their internal decoder layers. It offers specialized classes like AblationDecoderLayer and AdditionDecoderLayer to systematically remove or enhance specific functionalities within the model's architecture. The library supports the definition of custom "behavior directions" for precise control over the nature of these alterations. Additionally, it includes an ExpressionRefusalScorer to quantitatively assess the presence of refusal phrases in model outputs, aiding in the analysis of safety-related behaviors.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/tsadoq/erisforge.git), navigate to the directory (cd erisforge), and install dependencies (pip install -r requirements.txt). Alternatively, install directly via pip: pip install erisforge.
Prerequisites: Requires Python, torch, and the transformers library. Usage involves loading models and tokenizers, typically from the Hugging Face Hub.
Links: Example usage snippets are provided in the README, with a reference to a notebook for a more comprehensive demonstration of model layer transformation.

Highlighted Details

Directly modifies internal layers of LLMs for altered response behaviors.
Features AblationDecoderLayer and AdditionDecoderLayer for systematic modification.
Includes ExpressionRefusalScorer to measure model refusal expressions.
Supports custom behavior directions for fine-grained control.
Transformed models can be saved locally or pushed to the HuggingFace Hub.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap. Contributions are encouraged through standard open-source practices like forking and submitting pull requests.

Licensing & Compatibility

License: MIT License.
Compatibility: The MIT license generally permits broad use, including commercial applications and integration into closed-source projects, though users should consult the full license text. The library is built upon and compatible with the Hugging Face Transformers ecosystem.

Limitations & Caveats

This library is explicitly provided for research and development purposes only. The author assumes no responsibility for any specific applications or uses of ErisForge. Its functionality is dependent on the underlying models and architecture supported by the Hugging Face Transformers library.

ErisForge by Tsadoq

Explore Similar Projects

BPO by thu-coai

reconstruction-alignment by HorizonWind2004

neuralgraffiti by babycommando

klarity by klara-research

llm-consortium by irthomasthomas

HuggingFists by Datayoo

exporters by huggingface

arc-agi-benchmarking by arcprize

nnsight by ndif-team

SwissArmyTransformer by THUDM

ReplitLM by replit

remove-refusals-with-transformers by Sumandora