PDF translation and bilingual comparison library
Top 10.7% on sourcepulse
BabelDOC is a Python library and CLI tool for translating PDF scientific papers, offering bilingual comparison and self-deployment options. It targets researchers and developers needing to process and translate academic documents, providing a pipeline for PDF parsing and rendering with support for various translation services.
How It Works
BabelDOC processes PDFs by parsing their structure (text blocks, images, tables) and then rendering this structure into a new PDF, potentially with translations. It aims to preserve original document structure, unlike some tools that convert to formats like XML, which can lose layout information. The pipeline is designed to be plugin-based, allowing for the integration of new models, OCR, and rendering engines.
Quick Start & Requirements
uv tool install --python 3.12 BabelDOC babeldoc
babeldoc --files example.pdf --bing
or uv run babeldoc --files example.pdf --openai --openai-api-key "your-api-key-here"
Highlighted Details
pdf2zh
2.0.Maintenance & Community
The project is sponsored by Immersive Translation. Contributions are encouraged via a CONTRIBUTING guide and a Code of Conduct.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Known issues include parsing errors in author/reference sections, lack of line support, no support for drop caps, and skipping of large pages. The project's Python API is explicitly stated to not guarantee compatibility before pdf2zh
2.0. The primary focus is English-to-Chinese translation, with limited testing for other language pairs.
3 days ago
1 day