unstructured  by Unstructured-IO

ETL solution for structuring unstructured data for language models

created 2 years ago
12,149 stars

Top 4.1% on sourcepulse

GitHubView on GitHub
Project Summary

Unstructured.io provides an open-source Python library for transforming complex documents (PDFs, HTML, DOCX, etc.) into structured data, primarily for use with Large Language Models (LLMs). It offers modular functions and connectors to simplify data ingestion and pre-processing, making it suitable for data engineers and ML practitioners working with diverse document formats.

How It Works

The library employs a modular, connector-based architecture to ingest and pre-process various document types. It leverages external tools like tesseract-ocr for image-based text extraction and pandoc for document format conversion. The partition function acts as an intelligent router, detecting file types and applying the appropriate parsing logic, aiming for efficient and adaptable data transformation.

Quick Start & Requirements

  • Install: pip install "unstructured[all-docs]" for full functionality.
  • Prerequisites: System dependencies include libmagic-dev, poppler-utils, tesseract-ocr (with language packs), libreoffice, and pandoc (v2.14.2+ for RTF). Docker images are available.
  • Resources: Local inference for images/PDFs may require tesseract and poppler.
  • Docs: https://docs.unstructured.io/

Highlighted Details

  • Supports a wide array of document types including PDFs, HTML, DOCX, EML, TXT, and more.
  • Offers a Serverless API for enhanced performance and production workflows.
  • Provides Docker images for simplified deployment and cross-platform compatibility (x86_64, Apple Silicon).
  • Includes optional pre-commit hooks for local development contributions.

Maintenance & Community

The project has a healthy contributor count and actively maintains its releases. Community engagement channels are available via Discord/Slack.

Licensing & Compatibility

The library is released under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Installation of system dependencies can be complex, particularly on Windows. Performance for complex documents or large batches may necessitate the use of the paid Serverless API.

Health Check
Last commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
22
Issues (30d)
6
Star History
1,205 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
2 more.

MegaParse by QuivrHQ

0.5%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
2 more.

llmware by llmware-ai

0.2%
14k
Framework for enterprise RAG pipelines using small, specialized models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.