Discover and explore top open-source AI tools and projects—updated daily.
openaiPII detection and masking for on-premises data sanitization
New!
Top 36.1% on SourcePulse
OpenAI Privacy Filter addresses the need for fast, context-aware, and tunable on-premises Personally Identifiable Information (PII) detection and masking. It targets teams requiring high-throughput data sanitization, offering a model that can be run locally, reducing reliance on external services and enhancing data control. The primary benefit is efficient, customizable PII handling within existing workflows.
How It Works
The model employs a bidirectional token-classification approach, initially pretrained autoregressively and then converted into a classifier. It processes input sequences in a single forward pass, predicting probabilities for 8 privacy label categories. Coherent PII spans are decoded using a constrained Viterbi procedure, which optimizes label sequences globally for improved boundary stability over independent token predictions. This architecture prioritizes throughput and context-aware span identification.
Quick Start & Requirements
pip install -e .opf CLI (e.g., opf "text to process"). Supports GPU (default) and CPU (--device cpu).~/.opf/privacy_filter) or can be specified via --checkpoint.Highlighted Details
Maintenance & Community
The provided README does not detail specific contributors, sponsorships, or community channels (e.g., Discord/Slack). Development is attributed to OpenAI.
Licensing & Compatibility
Licensed under Apache 2.0, this license is permissive and generally compatible with commercial use and closed-source linking, allowing for experimentation, customization, and deployment without significant restrictions.
Limitations & Caveats
Privacy Filter is a data minimization aid, not a comprehensive anonymization or compliance guarantee; over-reliance is discouraged. The model's static label policy requires fine-tuning for organizations with differing definitions of PII or specific governance needs. Performance may degrade on non-English text, non-Latin scripts, or out-of-distribution domains. Potential failure modes include under/over-detection of PII, fragmented span boundaries, and misclassification of benign strings. High-risk deployments (medical, legal, financial) warrant extreme caution and human review.
2 days ago
Inactive
HazyResearch
JIA-Lab-research