Discover and explore top open-source AI tools and projects—updated daily.
chiefautismPII extraction tool for structured data parsing
Top 72.2% on SourcePulse
This project provides a tool for extracting Personally Identifiable Information (PII) from text, serving as the reverse of OpenAI's Privacy Filter. It targets security professionals auditing data for potential leaks and forensic analysts parsing compromised data. The primary benefit is the conversion of masked PII into structured, actionable data spans.
How It Works
The system leverages the same 1.5B parameter model and taxonomy as OpenAI's Privacy Filter but is configured for extraction rather than masking. Its architecture involves an initial pass with the opf 1.5B model to generate BIOES logits, followed by Viterbi decoding to produce character spans. These spans are then refined through a span-merging process (e.g., combining first and last names) and augmented by a regex backstop to catch entities like URLs, secrets, and account numbers that the model might miss. This hybrid approach aims for comprehensive and accurate PII identification.
Quick Start & Requirements
uv pip install -e ./privacy-filter and uv pip install -e ./pii_parser.uv. The first run downloads a ~3 GB checkpoint to ~/.opf/privacy_filter/.Highlighted Details
private_person, private_email, private_phone, private_address, private_url, private_date, account_number, and secret.Maintenance & Community
No specific details regarding maintainers, community channels, or roadmap were found in the provided README.
Licensing & Compatibility
The project is licensed under the Apache-2.0 license, which generally permits commercial use and integration into closed-source projects.
Limitations & Caveats
The README does not explicitly detail limitations. However, the model-only parser has a lower F1 score (0.733) compared to the hybrid approach, and the ~600 ms CPU latency for the hybrid parser may be a consideration for real-time, high-throughput applications.
1 month ago
Inactive
diffgram
openai