BabelDOC by funstory-ai

PDF translation and bilingual comparison library

Created 1 year ago

7,785 stars

Top 6.6% on SourcePulse

Project Summary

BabelDOC is a Python library and CLI tool for translating PDF scientific papers, offering bilingual comparison and self-deployment options. It targets researchers and developers needing to process and translate academic documents, providing a pipeline for PDF parsing and rendering with support for various translation services.

How It Works

BabelDOC processes PDFs by parsing their structure (text blocks, images, tables) and then rendering this structure into a new PDF, potentially with translations. It aims to preserve original document structure, unlike some tools that convert to formats like XML, which can lose layout information. The pipeline is designed to be plugin-based, allowing for the integration of new models, OCR, and rendering engines.

Quick Start & Requirements

Install: uv tool install --python 3.12 BabelDOC babeldoc
Prerequisites: Python 3.12, uv (for installation).
Usage: babeldoc --files example.pdf --bing or uv run babeldoc --files example.pdf --openai --openai-api-key "your-api-key-here"
Docs: https://github.com/funstory-ai/BabelDOC

Highlighted Details

Supports translation via OpenAI-compatible APIs (e.g., GPT-4o-mini) and Bing.
Offers options for bilingual PDF output, page splitting for large documents, and compatibility enhancements.
Includes functionality to generate and restore offline asset packages for air-gapped environments.
Provides a Python API, though it's noted as potentially unstable before pdf2zh 2.0.

Maintenance & Community

The project is sponsored by Immersive Translation. Contributions are encouraged via a CONTRIBUTING guide and a Code of Conduct.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Known issues include parsing errors in author/reference sections, lack of line support, no support for drop caps, and skipping of large pages. The project's Python API is explicitly stated to not guarantee compatibility before pdf2zh 2.0. The primary focus is English-to-Chinese translation, with limited testing for other language pairs.

BabelDOC by funstory-ai

Explore Similar Projects

attranslate by fkirc

Index_PDF_Translation by Mega-Gorilla

ChatGPT-for-Translation by Raychanan

docutranslate by xunbu

zotero-pdf2zh by guaguastandup

ebook-GPT-translator by jesselau76

zotero-pdf-translate by windingwind

Easydict by tisfeng

TranslationPlugin by YiiGuxing

bilingual_book_maker by yihong0618

LibreTranslate by LibreTranslate

PDFMathTranslate by PDFMathTranslate