doctran by finic-ai

LLM tool for document transformation using natural language instructions

Created 2 years ago

507 stars

Top 61.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Gabriel Almeida

Cofounder of Langflow

Rodrigo Nader

Cofounder of Langflow

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Doctran is a Python framework for transforming unstructured text into structured data using Large Language Models (LLMs). It targets developers and researchers needing to process complex text for tasks like data labeling or semantic information extraction, offering a modular, declarative wrapper around OpenAI's function calling feature to simplify LLM interactions.

How It Works

Doctran acts as an LLM-powered processing pipeline, taking messy strings as input and producing clean, structured output. It leverages OpenAI's function calling capabilities to extract data based on provided JSON schemas and offers built-in transformers for common tasks like redaction (using spaCy locally), summarization, refinement, translation, and interrogation (converting text to Q&A pairs). The framework supports chaining these transformations in a specified order, allowing for complex multi-step processing workflows.

Quick Start & Requirements

Install via pip: pip install doctran
Requires an OpenAI API key.

Example usage:

from doctran import Doctran
doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="your_content_as_string")

Official examples available at examples.ipynb.

Highlighted Details

Converts unstructured text into semi-structured JSON optimized for vector search.
Supports chaining transformations like redact, extract, summarize, refine, translate, and interrogate.
Includes a local PII redaction transformer using spaCy, avoiding external API calls for sensitive data.
Facilitates custom transformer development by extending DocumentTransformer or OpenAIDocumentTransformer.

Maintenance & Community

Lightly maintained by jasonwcfan.
Contributions are welcomed, particularly for transformers that do not rely on external APIs.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

Relies heavily on OpenAI's API for many core functionalities, incurring associated costs and latency.
The summarize transformer notes that token_limit may not be strictly respected by OpenAI.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days