DataInfra-RedactionEverything by TracyWang95

Local-first data redaction for documents and images

Created 5 months ago

1,146 stars

Top 33.0% on SourcePulse

Project Summary

RedactionEverything addresses the critical need for local-first, privacy-preserving redaction of sensitive information across diverse unstructured data formats including documents, scanned PDFs, images, and text. It targets engineers, researchers, and power users who require robust data anonymization without relying on external APIs, offering a comprehensive workbench for finding, reviewing, and redacting sensitive content. The primary benefit is enhanced data security and compliance by keeping all processing within a local or intranet environment.

How It Works

The system employs a local-first architecture, splitting processing into distinct text and vision pipelines. Textual data is handled by HaS Text semantic NER, with regex as a fallback. The vision pipeline integrates OCR for text extraction from images and scanned documents, HaS Image YOLO for detecting visual privacy regions (faces, seals, etc.), and a VLM for more nuanced visual-semantic tasks like signature detection. It supports configurable schemas, including general, legal, finance, and healthcare presets, enabling domain-specific redaction. The workflow encompasses recognition, human review, redaction, and export, providing a complete anonymization solution.

Quick Start & Requirements

Primary install / run command: npm run dev (Windows + WSL).
Non-default prerequisites and dependencies: Node.js (24 LTS), Python (3.11), NVIDIA GPU with 16 GB VRAM recommended for the full vision pipeline, CUDA compatible with model runtimes.
Setup: Local path configuration is required for model weights and data.

Highlighted Details

Supports a wide range of file types: TXT, DOCX, PDF, scanned PDF, PNG, JPG.
Integrates semantic NER, OCR, YOLO-based object detection, and VLM for comprehensive sensitive data identification.
Offers configurable schemas for general, legal, finance, and healthcare use cases.
Designed for local or intranet deployment, ensuring data privacy.
Includes batch processing, task management, and review workflows.

Maintenance & Community

The project includes CI via GitHub Actions and welcomes pull requests. No specific community channels (e.g., Discord, Slack) or roadmap links are detailed in the README.

Licensing & Compatibility

The project is released under a custom "Personal Use License." This license permits free use for individuals for personal, non-commercial purposes. Commercial use, including by companies, institutions, or for production deployments, requires a separate commercial license.

Limitations & Caveats

The full vision pipeline, particularly VLM-based signature detection, requires a recommended 16 GB of VRAM; systems with less may experience performance degradation. The VLM is a complementary layer to YOLO and not a direct replacement. The project prioritizes practical local deployment over utilizing the largest possible models.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

457 stars in the last 30 days