extractous  by yobix-ai

Rust SDK for fast unstructured data extraction

Created 1 year ago
1,260 stars

Top 31.4% on SourcePulse

GitHubView on GitHub
Project Summary

Extractous provides a high-performance, low-memory solution for extracting text and metadata from a wide array of unstructured document formats, targeting developers who need efficient local data processing without relying on external APIs. It aims to significantly outperform existing Python-based libraries like unstructured-io by leveraging Rust for its core engine.

How It Works

The core of Extractous is written in Rust, utilizing its performance, memory safety, and multi-threading capabilities to bypass Python's Global Interpreter Lock (GIL) limitations. For formats not natively supported by the Rust core, it integrates Apache Tika via native shared libraries compiled with GraalVM. This approach ensures pure native execution without external servers or garbage collection. OCR capabilities are integrated using tesseract-ocr.

Quick Start & Requirements

Highlighted Details

  • Claims to be up to 25x faster and use 11x less memory than unstructured-io.
  • Supports a broad range of formats including Office documents, PDFs, HTML, E-mails, and images (with OCR).
  • Offers bindings for Python, with plans for JavaScript/TypeScript.
  • Includes OCR support for extracting text from images and scanned documents.

Maintenance & Community

  • Actively maintained, with contributions welcomed via issues or pull requests.
  • No specific community links (Discord/Slack) or notable contributors/sponsorships are mentioned in the README.

Licensing & Compatibility

  • Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

  • Currently, only Python bindings are available, though JavaScript/TypeScript bindings are planned.
  • The README mentions extensive format support via Apache Tika integration, but the specific Tika version and its native compilation details are not elaborated upon.
Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
121 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.