docetl  by ucbepic

Agentic LLM-powered system for data processing and ETL

created 1 year ago
2,444 stars

Top 19.4% on sourcepulse

GitHubView on GitHub
Project Summary

DocETL is a system for building and executing LLM-powered data processing pipelines, particularly for complex document tasks. It targets developers and researchers needing an interactive environment for prompt engineering and a production-ready Python package for pipeline execution, offering iterative development and automated data transformation.

How It Works

DocETL utilizes a pipeline-based architecture where each step is an operator that can be configured with LLM prompts. It supports chaining these operators to create complex workflows for tasks like data extraction, summarization, and transformation. The system emphasizes iterative development through an interactive UI (DocWrangler) that allows real-time prompt testing and pipeline visualization before exporting for production use.

Quick Start & Requirements

  • Interactive UI (DocWrangler):
    • Docker (recommended): make docker
    • Manual Setup: git clone, set .env and .env.local files, make install, make install-ui, make run-ui-dev. Access at http://localhost:3000/playground.
  • Python Package:
    • Install: pip install docetl
    • Prerequisites: Python 3.10+, OpenAI API key (or other LLM provider via liteLLM).
  • AWS Bedrock Support: Requires AWS credentials configured via aws configure or environment variables.
  • Resources: Local setup requires Docker or manual installation of Python dependencies. Running pipelines requires LLM API access, incurring costs.

Highlighted Details

  • Interactive UI (DocWrangler) for iterative prompt engineering and pipeline development.
  • Production-ready Python package for command-line or programmatic pipeline execution.
  • Supports integration with OpenAI and AWS Bedrock LLM providers.
  • Includes community projects and educational resources for learning and contribution.

Maintenance & Community

The project is hosted on GitHub with community contributions encouraged. Links to community discussions or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. It is crucial to verify the license for commercial use or integration into closed-source projects.

Limitations & Caveats

The system relies heavily on LLM APIs, which can incur costs and introduce variability in output. Specific LLM provider configurations and model compatibility details are linked to liteLLM documentation. The project appears to be actively developed, and breaking changes may occur.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
23
Issues (30d)
4
Star History
527 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.