unstructured  by Unstructured-IO

ETL solution for structuring unstructured data for language models

Created 3 years ago
12,687 stars

Top 3.9% on SourcePulse

GitHubView on GitHub
Project Summary

Unstructured.io provides an open-source Python library for transforming complex documents (PDFs, HTML, DOCX, etc.) into structured data, primarily for use with Large Language Models (LLMs). It offers modular functions and connectors to simplify data ingestion and pre-processing, making it suitable for data engineers and ML practitioners working with diverse document formats.

How It Works

The library employs a modular, connector-based architecture to ingest and pre-process various document types. It leverages external tools like tesseract-ocr for image-based text extraction and pandoc for document format conversion. The partition function acts as an intelligent router, detecting file types and applying the appropriate parsing logic, aiming for efficient and adaptable data transformation.

Quick Start & Requirements

  • Install: pip install "unstructured[all-docs]" for full functionality.
  • Prerequisites: System dependencies include libmagic-dev, poppler-utils, tesseract-ocr (with language packs), libreoffice, and pandoc (v2.14.2+ for RTF). Docker images are available.
  • Resources: Local inference for images/PDFs may require tesseract and poppler.
  • Docs: https://docs.unstructured.io/

Highlighted Details

  • Supports a wide array of document types including PDFs, HTML, DOCX, EML, TXT, and more.
  • Offers a Serverless API for enhanced performance and production workflows.
  • Provides Docker images for simplified deployment and cross-platform compatibility (x86_64, Apple Silicon).
  • Includes optional pre-commit hooks for local development contributions.

Maintenance & Community

The project has a healthy contributor count and actively maintains its releases. Community engagement channels are available via Discord/Slack.

Licensing & Compatibility

The library is released under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Installation of system dependencies can be complex, particularly on Windows. Performance for complex documents or large batches may necessitate the use of the paid Serverless API.

Health Check
Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
14
Issues (30d)
4
Star History
313 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
4 more.

MegaParse by QuivrHQ

0.1%
7k
File parser optimized for LLM ingestion
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.