privacy-filter  by openai

PII detection and masking for on-premises data sanitization

Created 1 week ago

New!

1,032 stars

Top 36.1% on SourcePulse

GitHubView on GitHub
Project Summary

OpenAI Privacy Filter addresses the need for fast, context-aware, and tunable on-premises Personally Identifiable Information (PII) detection and masking. It targets teams requiring high-throughput data sanitization, offering a model that can be run locally, reducing reliance on external services and enhancing data control. The primary benefit is efficient, customizable PII handling within existing workflows.

How It Works

The model employs a bidirectional token-classification approach, initially pretrained autoregressively and then converted into a classifier. It processes input sequences in a single forward pass, predicting probabilities for 8 privacy label categories. Coherent PII spans are decoded using a constrained Viterbi procedure, which optimizes label sequences globally for improved boundary stability over independent token predictions. This architecture prioritizes throughput and context-aware span identification.

Quick Start & Requirements

  • Install: pip install -e .
  • Run: Use the opf CLI (e.g., opf "text to process"). Supports GPU (default) and CPU (--device cpu).
  • Prerequisites: Python environment. Checkpoints are automatically downloaded if not found locally (~/.opf/privacy_filter) or can be specified via --checkpoint.
  • Features: Handles long contexts up to 128,000 tokens without chunking.

Highlighted Details

  • Permissive Apache 2.0 license, suitable for commercial deployment.
  • Small footprint: 1.5B total parameters, with only 50M active, enabling browser or laptop execution.
  • Fine-tunable for adaptation to specific data distributions using efficient methods.
  • Runtime control over precision/recall trade-offs and detected span lengths.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels (e.g., Discord/Slack). Development is attributed to OpenAI.

Licensing & Compatibility

Licensed under Apache 2.0, this license is permissive and generally compatible with commercial use and closed-source linking, allowing for experimentation, customization, and deployment without significant restrictions.

Limitations & Caveats

Privacy Filter is a data minimization aid, not a comprehensive anonymization or compliance guarantee; over-reliance is discouraged. The model's static label policy requires fine-tuning for organizations with differing definitions of PII or specific governance needs. Performance may degrade on non-English text, non-Latin scripts, or out-of-distribution domains. Potential failure modes include under/over-detection of PII, fragmented span boundaries, and misclassification of benign strings. High-risk deployments (medical, legal, financial) warrant extreme caution and human review.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
4
Star History
1,106 stars in the last 7 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
4 more.

LongLoRA by JIA-Lab-research

0%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.