privacy-parser  by chiefautism

PII extraction tool for structured data parsing

Created 1 month ago
398 stars

Top 72.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a tool for extracting Personally Identifiable Information (PII) from text, serving as the reverse of OpenAI's Privacy Filter. It targets security professionals auditing data for potential leaks and forensic analysts parsing compromised data. The primary benefit is the conversion of masked PII into structured, actionable data spans.

How It Works

The system leverages the same 1.5B parameter model and taxonomy as OpenAI's Privacy Filter but is configured for extraction rather than masking. Its architecture involves an initial pass with the opf 1.5B model to generate BIOES logits, followed by Viterbi decoding to produce character spans. These spans are then refined through a span-merging process (e.g., combining first and last names) and augmented by a regex backstop to catch entities like URLs, secrets, and account numbers that the model might miss. This hybrid approach aims for comprehensive and accurate PII identification.

Quick Start & Requirements

  • Installation: Clone the repository and run uv pip install -e ./privacy-filter and uv pip install -e ./pii_parser.
  • Prerequisites: Python environment with uv. The first run downloads a ~3 GB checkpoint to ~/.opf/privacy_filter/.
  • Usage: Python API and CLI examples are provided in the README.
  • Resources: The hybrid parser achieves ~600 ms latency on a CPU.

Highlighted Details

  • Employs the same 1.5B model and label taxonomy as OpenAI's Privacy Filter.
  • The hybrid approach achieves a benchmark F1 score of 0.929.
  • Supports 8 PII categories: private_person, private_email, private_phone, private_address, private_url, private_date, account_number, and secret.

Maintenance & Community

No specific details regarding maintainers, community channels, or roadmap were found in the provided README.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The README does not explicitly detail limitations. However, the model-only parser has a lower F1 score (0.733) compared to the hybrid approach, and the ~600 ms CPU latency for the hybrid parser may be a consideration for real-time, high-throughput applications.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
37 stars in the last 30 days

Explore Similar Projects

Starred by Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
2 more.

privacy-filter by openai

1.8%
2k
PII detection and masking for on-premises data sanitization
Created 1 month ago
Updated 1 month ago
Feedback? Help us improve.