Rust SDK for fast unstructured data extraction
Top 33.5% on sourcepulse
Extractous provides a high-performance, low-memory solution for extracting text and metadata from a wide array of unstructured document formats, targeting developers who need efficient local data processing without relying on external APIs. It aims to significantly outperform existing Python-based libraries like unstructured-io
by leveraging Rust for its core engine.
How It Works
The core of Extractous is written in Rust, utilizing its performance, memory safety, and multi-threading capabilities to bypass Python's Global Interpreter Lock (GIL) limitations. For formats not natively supported by the Rust core, it integrates Apache Tika via native shared libraries compiled with GraalVM. This approach ensures pure native execution without external servers or garbage collection. OCR capabilities are integrated using tesseract-ocr
.
Quick Start & Requirements
pip install extractous
sudo apt install tesseract-ocr tesseract-ocr-deu
).Highlighted Details
unstructured-io
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
7 months ago
1 week