LLM tool for document transformation using natural language instructions
Top 62.3% on sourcepulse
Doctran is a Python framework for transforming unstructured text into structured data using Large Language Models (LLMs). It targets developers and researchers needing to process complex text for tasks like data labeling or semantic information extraction, offering a modular, declarative wrapper around OpenAI's function calling feature to simplify LLM interactions.
How It Works
Doctran acts as an LLM-powered processing pipeline, taking messy strings as input and producing clean, structured output. It leverages OpenAI's function calling capabilities to extract data based on provided JSON schemas and offers built-in transformers for common tasks like redaction (using spaCy locally), summarization, refinement, translation, and interrogation (converting text to Q&A pairs). The framework supports chaining these transformations in a specified order, allowing for complex multi-step processing workflows.
Quick Start & Requirements
pip install doctran
from doctran import Doctran
doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="your_content_as_string")
examples.ipynb
.Highlighted Details
redact
, extract
, summarize
, refine
, translate
, and interrogate
.DocumentTransformer
or OpenAIDocumentTransformer
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
summarize
transformer notes that token_limit
may not be strictly respected by OpenAI.11 months ago
1+ week