Agentic LLM-powered system for data processing and ETL
Top 19.4% on sourcepulse
DocETL is a system for building and executing LLM-powered data processing pipelines, particularly for complex document tasks. It targets developers and researchers needing an interactive environment for prompt engineering and a production-ready Python package for pipeline execution, offering iterative development and automated data transformation.
How It Works
DocETL utilizes a pipeline-based architecture where each step is an operator that can be configured with LLM prompts. It supports chaining these operators to create complex workflows for tasks like data extraction, summarization, and transformation. The system emphasizes iterative development through an interactive UI (DocWrangler) that allows real-time prompt testing and pipeline visualization before exporting for production use.
Quick Start & Requirements
make docker
git clone
, set .env
and .env.local
files, make install
, make install-ui
, make run-ui-dev
. Access at http://localhost:3000/playground
.pip install docetl
aws configure
or environment variables.Highlighted Details
Maintenance & Community
The project is hosted on GitHub with community contributions encouraged. Links to community discussions or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state the license. It is crucial to verify the license for commercial use or integration into closed-source projects.
Limitations & Caveats
The system relies heavily on LLM APIs, which can incur costs and introduce variability in output. Specific LLM provider configurations and model compatibility details are linked to liteLLM documentation. The project appears to be actively developed, and breaking changes may occur.
2 days ago
1 day