Index_PDF_Translation by Mega-Gorilla

Local PDF translator preserving document layout

Created 1 year ago

301 stars

Top 88.8% on SourcePulse

Project Summary

Summary

This project provides a local command-line tool, formerly a web service, for translating academic PDFs while preserving original formatting. It addresses the challenge of accurately translating complex documents by intelligently identifying and processing text blocks, making it beneficial for researchers and academics needing to understand foreign-language papers.

How It Works

The tool leverages PyMuPDF for robust text and coordinate extraction from PDFs. It employs spaCy for natural language processing to automatically identify and classify text blocks, distinguishing between main body text, figure/table captions, and elements to ignore. This classification informs a novel cross-block translation approach that merges fragmented sentences across block and page boundaries to maintain contextual integrity. The processed text is then translated using pluggable backends (Google, DeepL, OpenAI) and re-inserted into a new PDF, optionally generating a side-by-side comparison.

Quick Start & Requirements

Requires Python 3.11+. Installation is via uv sync or pip install -r requirements.txt, followed by downloading spaCy language models (en_core_web_sm, ja_core_news_sm). Google Translate is the default backend and requires no API key. DeepL and OpenAI backends necessitate API keys and potentially additional package installations (index-pdf-translation[deepl], index-pdf-translation[openai]).

Highlighted Details

Automatic identification and translation of figure/table captions, separate from main body text.
Cross-block and cross-page translation to resolve sentence fragmentation issues.
Generation of a side-by-side PDF output comparing the original and translated documents.
An optional debug mode visualizes text block classification and distribution histograms for analysis.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or project roadmaps.

Licensing & Compatibility

Licensed under GNU Affero General Public License v3.0 (AGPL-3.0). This strong copyleft license requires that any modifications or derivative works distributed must also be made available under the AGPL-3.0. Compatibility with closed-source projects may be restricted due to its viral nature.

Limitations & Caveats

The tool cannot process scanned PDFs that lack an OCR layer, as text extraction will fail. Complex PDF layouts may lead to inaccurate block classification or text insertion issues. While the debug mode aids analysis, resolving intricate layout problems might require manual intervention.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days