circuit-breakers  by GraySwanAI

Enhancing AI alignment and robustness

Created 1 year ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

This repository presents "Circuit Breaking," a novel technique designed to significantly improve the alignment and robustness of AI systems. Inspired by representation engineering, the core objective is to prevent AI models, encompassing both Large Language Models (LLMs) and multimodal systems, from generating harmful or undesirable content. It achieves this by directly identifying and altering specific internal model representations that lead to harmful outputs. Circuit Breaking offers a distinct alternative to conventional safety mechanisms such as simple refusal strategies or adversarial training, promising enhanced protection against sophisticated, previously unseen adversarial attacks without degrading the model's inherent capabilities or performance.

How It Works

The Circuit Breaking methodology operates by targeting and modifying the internal "circuitry" of AI models that are responsible for generating harmful content. By leveraging principles from representation engineering, the approach aims to precisely locate and neutralize problematic representations within the model's latent space. This direct manipulation of representations is advantageous as it can preempt harmful generation before it occurs, offering a more fundamental safeguard than post-hoc refusal mechanisms. The technique is designed to be robust against strong adversarial perturbations, including those not encountered during training, thereby addressing a critical gap in current AI safety research and ensuring that model utility is not sacrificed for safety.

Quick Start & Requirements

The provided README snippet does not include details on installation procedures, specific software dependencies (e.g., Python versions, libraries), hardware requirements (e.g., GPU, CUDA), or estimated setup times.

Highlighted Details

  • Enhanced AI Safety: Directly addresses alignment and robustness concerns in AI models.
  • Adversarial Defense: Provides robust protection against strong, unseen adversarial attacks.
  • Broad Applicability: Designed for both Large Language Models (LLMs) and multimodal AI systems.
  • Capability Preservation: Aims to maintain model performance and utility while implementing safety measures.

Maintenance & Community

Information regarding project maintenance status, community engagement channels (such as Discord or Slack), notable contributors, or a public roadmap is not available in the provided README excerpt.

Licensing & Compatibility

The README snippet does not specify the software license under which this project is distributed, nor does it offer guidance on compatibility for commercial use or integration into closed-source systems.

Limitations & Caveats

No specific limitations, known bugs, alpha/beta status, or unsupported platforms are mentioned within the provided README content.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luca Soldaini Luca Soldaini(Research Scientist at Ai2), and
7 more.

hh-rlhf by anthropics

0%
2k
RLHF dataset for training safe AI assistants
Created 3 years ago
Updated 6 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
4 more.

L1B3RT4S by elder-plinius

0.7%
17k
AI jailbreak prompts
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.