presidio  by microsoft

PII de-identification SDK for text and images

Created 7 years ago
7,033 stars

Top 7.3% on SourcePulse

GitHubView on GitHub
Project Summary

Presidio is an open-source SDK for detecting, redacting, masking, and anonymizing Personally Identifiable Information (PII) across text, images, and structured data. It targets developers and organizations needing to protect sensitive data, offering a customizable and extensible framework for data privacy compliance.

How It Works

Presidio employs a modular architecture, featuring an Analyzer for PII detection and an Anonymizer for data transformation. The Analyzer supports a hybrid approach, combining Named Entity Recognition (NER) models, regular expressions, rule-based logic, and checksums, with options to integrate external PII detection models. This allows for context-aware identification of sensitive entities across multiple languages.

Quick Start & Requirements

  • Install: pip install presidio-analyzer presidio-anonymizer
  • Prerequisites: Python 3.7+, Docker for certain deployments.
  • Resources: Requires downloading language models for NER.
  • Docs: Full documentation
  • Demo: Demo
  • Examples: Examples

Highlighted Details

  • Supports PII detection and redaction in images, including DICOM medical images.
  • Offers Python, PySpark, Docker, and Kubernetes deployment options.
  • Highly customizable PII recognizers and de-identification pipelines.
  • Handles various PII types like credit card numbers, names, and phone numbers.

Maintenance & Community

  • Actively maintained by Microsoft.
  • Community discussions via GitHub discussions.
  • Bug reports and suggestions via GitHub issues.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

While Presidio automates PII detection, it does not guarantee the identification of all sensitive information, necessitating supplementary systems for comprehensive data protection.

Health Check
Last Commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
40
Issues (30d)
5
Star History
306 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
2 more.

Curator by NVIDIA-NeMo

0.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 22 hours ago
Feedback? Help us improve.