privacy-filter by openai

PII detection and masking for on-premises data sanitization

Created 2 months ago

2,549 stars

Top 17.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Joe Walnes

Head of Experimental Projects at Stripe

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Travis Fischer

Founder of Agentic

Zico Kolter

Board Member at OpenAI; ML Professor at CMU

Project Summary

OpenAI Privacy Filter addresses the need for fast, context-aware, and tunable on-premises Personally Identifiable Information (PII) detection and masking. It targets teams requiring high-throughput data sanitization, offering a model that can be run locally, reducing reliance on external services and enhancing data control. The primary benefit is efficient, customizable PII handling within existing workflows.

How It Works

The model employs a bidirectional token-classification approach, initially pretrained autoregressively and then converted into a classifier. It processes input sequences in a single forward pass, predicting probabilities for 8 privacy label categories. Coherent PII spans are decoded using a constrained Viterbi procedure, which optimizes label sequences globally for improved boundary stability over independent token predictions. This architecture prioritizes throughput and context-aware span identification.

Quick Start & Requirements

Install: pip install -e .
Run: Use the opf CLI (e.g., opf "text to process"). Supports GPU (default) and CPU (--device cpu).
Prerequisites: Python environment. Checkpoints are automatically downloaded if not found locally (~/.opf/privacy_filter) or can be specified via --checkpoint.
Features: Handles long contexts up to 128,000 tokens without chunking.

Highlighted Details

Permissive Apache 2.0 license, suitable for commercial deployment.
Small footprint: 1.5B total parameters, with only 50M active, enabling browser or laptop execution.
Fine-tunable for adaptation to specific data distributions using efficient methods.
Runtime control over precision/recall trade-offs and detected span lengths.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels (e.g., Discord/Slack). Development is attributed to OpenAI.

Licensing & Compatibility

Licensed under Apache 2.0, this license is permissive and generally compatible with commercial use and closed-source linking, allowing for experimentation, customization, and deployment without significant restrictions.

Limitations & Caveats

Privacy Filter is a data minimization aid, not a comprehensive anonymization or compliance guarantee; over-reliance is discouraged. The model's static label policy requires fine-tuning for organizations with differing definitions of PII or specific governance needs. Performance may degrade on non-English text, non-Latin scripts, or out-of-distribution domains. Potential failure modes include under/over-detection of PII, fragmented span boundaries, and misclassification of benign strings. High-risk deployments (medical, legal, financial) warrant extreme caution and human review.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

128 stars in the last 30 days