privacy-parser by chiefautism

PII extraction tool for structured data parsing

Created 2 months ago

404 stars

Top 71.5% on SourcePulse

Project Summary

This project provides a tool for extracting Personally Identifiable Information (PII) from text, serving as the reverse of OpenAI's Privacy Filter. It targets security professionals auditing data for potential leaks and forensic analysts parsing compromised data. The primary benefit is the conversion of masked PII into structured, actionable data spans.

How It Works

The system leverages the same 1.5B parameter model and taxonomy as OpenAI's Privacy Filter but is configured for extraction rather than masking. Its architecture involves an initial pass with the opf 1.5B model to generate BIOES logits, followed by Viterbi decoding to produce character spans. These spans are then refined through a span-merging process (e.g., combining first and last names) and augmented by a regex backstop to catch entities like URLs, secrets, and account numbers that the model might miss. This hybrid approach aims for comprehensive and accurate PII identification.

Quick Start & Requirements

Installation: Clone the repository and run uv pip install -e ./privacy-filter and uv pip install -e ./pii_parser.
Prerequisites: Python environment with uv. The first run downloads a ~3 GB checkpoint to ~/.opf/privacy_filter/.
Usage: Python API and CLI examples are provided in the README.
Resources: The hybrid parser achieves ~600 ms latency on a CPU.

Highlighted Details

Employs the same 1.5B model and label taxonomy as OpenAI's Privacy Filter.
The hybrid approach achieves a benchmark F1 score of 0.929.
Supports 8 PII categories: private_person, private_email, private_phone, private_address, private_url, private_date, account_number, and secret.

Maintenance & Community

No specific details regarding maintainers, community channels, or roadmap were found in the provided README.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, which generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The README does not explicitly detail limitations. However, the model-only parser has a lower F1 score (0.733) compared to the hybrid approach, and the ~600 ms CPU latency for the hybrid parser may be a consideration for real-time, high-throughput applications.

privacy-parser by chiefautism

Explore Similar Projects

privacy-filter.cpp by localai-org

privacy-filter by packyme

kura by jxnl

kiji-proxy by dataiku

diffgram by diffgram

cortex.cpp by janhq

WPeGPT by WPeace-HcH

privacy-filter by openai

dataclaw by peteromallet

data-prep-kit by data-prep-kit

logfire by pydantic

presidio by data-privacy-stack