extractous  by yobix-ai

Rust SDK for fast unstructured data extraction

created 1 year ago
1,187 stars

Top 33.5% on sourcepulse

GitHubView on GitHub
Project Summary

Extractous provides a high-performance, low-memory solution for extracting text and metadata from a wide array of unstructured document formats, targeting developers who need efficient local data processing without relying on external APIs. It aims to significantly outperform existing Python-based libraries like unstructured-io by leveraging Rust for its core engine.

How It Works

The core of Extractous is written in Rust, utilizing its performance, memory safety, and multi-threading capabilities to bypass Python's Global Interpreter Lock (GIL) limitations. For formats not natively supported by the Rust core, it integrates Apache Tika via native shared libraries compiled with GraalVM. This approach ensures pure native execution without external servers or garbage collection. OCR capabilities are integrated using tesseract-ocr.

Quick Start & Requirements

Highlighted Details

  • Claims to be up to 25x faster and use 11x less memory than unstructured-io.
  • Supports a broad range of formats including Office documents, PDFs, HTML, E-mails, and images (with OCR).
  • Offers bindings for Python, with plans for JavaScript/TypeScript.
  • Includes OCR support for extracting text from images and scanned documents.

Maintenance & Community

  • Actively maintained, with contributions welcomed via issues or pull requests.
  • No specific community links (Discord/Slack) or notable contributors/sponsorships are mentioned in the README.

Licensing & Compatibility

  • Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

  • Currently, only Python bindings are available, though JavaScript/TypeScript bindings are planned.
  • The README mentions extensive format support via Apache Tika integration, but the specific Tika version and its native compilation details are not elaborated upon.
Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
113 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
2 more.

MegaParse by QuivrHQ

0.5%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Feedback? Help us improve.