doctran  by finic-ai

LLM tool for document transformation using natural language instructions

created 2 years ago
507 stars

Top 62.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Doctran is a Python framework for transforming unstructured text into structured data using Large Language Models (LLMs). It targets developers and researchers needing to process complex text for tasks like data labeling or semantic information extraction, offering a modular, declarative wrapper around OpenAI's function calling feature to simplify LLM interactions.

How It Works

Doctran acts as an LLM-powered processing pipeline, taking messy strings as input and producing clean, structured output. It leverages OpenAI's function calling capabilities to extract data based on provided JSON schemas and offers built-in transformers for common tasks like redaction (using spaCy locally), summarization, refinement, translation, and interrogation (converting text to Q&A pairs). The framework supports chaining these transformations in a specified order, allowing for complex multi-step processing workflows.

Quick Start & Requirements

  • Install via pip: pip install doctran
  • Requires an OpenAI API key.
  • Example usage:
    from doctran import Doctran
    doctran = Doctran(openai_api_key=OPENAI_API_KEY)
    document = doctran.parse(content="your_content_as_string")
    
  • Official examples available at examples.ipynb.

Highlighted Details

  • Converts unstructured text into semi-structured JSON optimized for vector search.
  • Supports chaining transformations like redact, extract, summarize, refine, translate, and interrogate.
  • Includes a local PII redaction transformer using spaCy, avoiding external API calls for sensitive data.
  • Facilitates custom transformer development by extending DocumentTransformer or OpenAIDocumentTransformer.

Maintenance & Community

  • Lightly maintained by jasonwcfan.
  • Contributions are welcomed, particularly for transformers that do not rely on external APIs.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Relies heavily on OpenAI's API for many core functionalities, incurring associated costs and latency.
  • The summarize transformer notes that token_limit may not be strictly respected by OpenAI.
Health Check
Last commit

11 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

instructor-js by 567-labs

0%
738
Typescript tool for structured extraction from LLMs
created 1 year ago
updated 6 months ago
Starred by Peter Norvig Peter Norvig(Author of Artificial Intelligence: A Modern Approach; Research Director at Google).

python-openai-demos by pamelafox

0%
374
Python scripts for OpenAI API demos
created 1 year ago
updated 1 week ago
Feedback? Help us improve.