BabelDOC  by funstory-ai

PDF translation and bilingual comparison library

created 8 months ago
4,687 stars

Top 10.7% on sourcepulse

GitHubView on GitHub
Project Summary

BabelDOC is a Python library and CLI tool for translating PDF scientific papers, offering bilingual comparison and self-deployment options. It targets researchers and developers needing to process and translate academic documents, providing a pipeline for PDF parsing and rendering with support for various translation services.

How It Works

BabelDOC processes PDFs by parsing their structure (text blocks, images, tables) and then rendering this structure into a new PDF, potentially with translations. It aims to preserve original document structure, unlike some tools that convert to formats like XML, which can lose layout information. The pipeline is designed to be plugin-based, allowing for the integration of new models, OCR, and rendering engines.

Quick Start & Requirements

  • Install: uv tool install --python 3.12 BabelDOC babeldoc
  • Prerequisites: Python 3.12, uv (for installation).
  • Usage: babeldoc --files example.pdf --bing or uv run babeldoc --files example.pdf --openai --openai-api-key "your-api-key-here"
  • Docs: https://github.com/funstory-ai/BabelDOC

Highlighted Details

  • Supports translation via OpenAI-compatible APIs (e.g., GPT-4o-mini) and Bing.
  • Offers options for bilingual PDF output, page splitting for large documents, and compatibility enhancements.
  • Includes functionality to generate and restore offline asset packages for air-gapped environments.
  • Provides a Python API, though it's noted as potentially unstable before pdf2zh 2.0.

Maintenance & Community

The project is sponsored by Immersive Translation. Contributions are encouraged via a CONTRIBUTING guide and a Code of Conduct.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Known issues include parsing errors in author/reference sections, lack of line support, no support for drop caps, and skipping of large pages. The project's Python API is explicitly stated to not guarantee compatibility before pdf2zh 2.0. The primary focus is English-to-Chinese translation, with limited testing for other language pairs.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
9
Issues (30d)
7
Star History
1,503 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.