DontFeedTheAI by zeroc00I

LLM data anonymization proxy for secure penetration testing

Created 2 weeks ago

New!

317 stars

Top 85.3% on SourcePulse

Project Summary

This project provides a transparent reverse proxy for Claude Code, designed to anonymize sensitive penetration testing data (like IPs, hashes, credentials, hostnames, and PII) before it's sent to the Anthropic API. It targets security professionals and researchers using LLMs for tasks involving client data, offering a robust solution to maintain data privacy and compliance during engagements. The primary benefit is enabling the use of powerful AI tools like Claude Code on sensitive pentest data without direct exposure of client-specific information.

How It Works

The system operates as an invisible proxy sitting between Claude Code and the Anthropic API. It intercepts all outgoing data—including bash command outputs, file reads, and grep results—identifying and replacing sensitive information with realistic-looking surrogates. This anonymization is achieved through a dual-layer detection mechanism: a local Ollama LLM (e.g., qwen3:4b) for context-aware entities like hostnames, usernames, and credentials, and a deterministic regex safety net for patterns such as IPs, CIDRs, hashes, and API keys. Mappings between original data and surrogates are stored persistently in a per-engagement SQLite vault, ensuring consistency and preventing collisions within a client's scope. Responses from Anthropic are de-anonymized using these mappings before being presented to Claude Code, guaranteeing that the LLM never processes actual client data.

Quick Start & Requirements

Primary Install/Run:
- Option A (VPS): Run the FastAPI proxy and Ollama on a remote VPS, exposing them via an SSH tunnel. Requires Python locally.
- Option B (Native): For Apple Silicon, use ./scripts/setup.sh, ollama pull qwen3:1.7b, then ./scripts/run.sh for the proxy and claude for the client.
- Option C (Docker): Use make docker-up for a containerized setup (CPU only).
Prerequisites: Python, Ollama (with models like qwen3:1.7b or qwen3:4b), Docker (for Option C).
Resource Footprint: Ollama models range from ~1GB (qwen3:1.7b) upwards. Setup involves script execution and model downloads.
Links: Detailed setup instructions are provided within the README.

Highlighted Details

Dual-layer anonymization combining an Ollama LLM for contextual data and regex for deterministic patterns.
Per-engagement PII Vault (SQLite) for consistent surrogate mapping and isolation between clients.
Self-improving feedback loop (scripts/auto_improve.py) for continuous enhancement of anonymization coverage, aiming for a 0% leak policy.
Comprehensive detection covers IPs, CIDRs, hashes, MACs, emails, domains, cloud tokens, JWTs, hostnames, usernames, passwords, organization names, and more.

Maintenance & Community

No specific details regarding maintainers, sponsorships, or community channels (like Discord or Slack) are provided in the README.

Licensing & Compatibility

The README does not specify a software license. Consequently, compatibility for commercial use or closed-source linking cannot be determined from the provided documentation.

Limitations & Caveats

The regex layer may miss context-dependent data like bare hostnames or unusual password formats, making the LLM layer essential. Very dense LLM outputs exceeding LLM_CHUNK_SIZE (default 1500 chars) might lose context at chunk boundaries. The system offers no provable privacy guarantee against metadata or writing style correlation attacks. There is a low, non-zero risk of surrogate collision if different original data maps to the same surrogate, though the per-engagement vault mitigates this within a session. This tool is not a substitute for reviewing NDAs and contracts regarding the use of cloud AI services.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

324 stars in the last 15 days