unstructured by Unstructured-IO

ETL solution for structuring unstructured data for language models

Created 3 years ago

13,580 stars

Top 3.6% on SourcePulse

View on GitHub

17 Experts Love This Project

Zack Li

Cofounder of Nexa AI

Will Brown

Research Lead at Prime Intellect

Pawel Garbacki

Cofounder of Fireworks AI

Han Wang

Cofounder of Mintlify

and 13 more!

Project Summary

Unstructured.io provides an open-source Python library for transforming complex documents (PDFs, HTML, DOCX, etc.) into structured data, primarily for use with Large Language Models (LLMs). It offers modular functions and connectors to simplify data ingestion and pre-processing, making it suitable for data engineers and ML practitioners working with diverse document formats.

How It Works

The library employs a modular, connector-based architecture to ingest and pre-process various document types. It leverages external tools like tesseract-ocr for image-based text extraction and pandoc for document format conversion. The partition function acts as an intelligent router, detecting file types and applying the appropriate parsing logic, aiming for efficient and adaptable data transformation.

Quick Start & Requirements

Install: pip install "unstructured[all-docs]" for full functionality.
Prerequisites: System dependencies include libmagic-dev, poppler-utils, tesseract-ocr (with language packs), libreoffice, and pandoc (v2.14.2+ for RTF). Docker images are available.
Resources: Local inference for images/PDFs may require tesseract and poppler.
Docs: https://docs.unstructured.io/

Highlighted Details

Supports a wide array of document types including PDFs, HTML, DOCX, EML, TXT, and more.
Offers a Serverless API for enhanced performance and production workflows.
Provides Docker images for simplified deployment and cross-platform compatibility (x86_64, Apple Silicon).
Includes optional pre-commit hooks for local development contributions.

Maintenance & Community

The project has a healthy contributor count and actively maintains its releases. Community engagement channels are available via Discord/Slack.

Licensing & Compatibility

The library is released under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Installation of system dependencies can be complex, particularly on Windows. Performance for complex documents or large batches may necessitate the use of the paid Serverless API.

Health Check

Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

215 stars in the last 30 days