Toolkit for linearizing PDFs for LLM datasets/training
Top 3.8% on sourcepulse
This toolkit addresses the challenge of extracting and linearizing text from diverse PDF documents for large language model (LLM) training datasets. It offers a comprehensive pipeline for processing millions of PDFs, targeting researchers and engineers working with document understanding and LLM fine-tuning.
How It Works
The core of olmOCR leverages a prompting strategy with GPT-4o for high-quality text parsing from PDF images. It processes PDFs through a fine-tuned Qwen2-VL and Molmo-O model, utilizing Sglang for efficient batch inference. The output is structured in Dolma-style JSONL, facilitating easy integration into LLM training workflows.
Quick Start & Requirements
pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
poppler-utils
, and Microsoft Core Fonts.Highlighted Details
Maintenance & Community
Developed and maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2).
Licensing & Compatibility
Licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Requires a recent NVIDIA GPU with substantial VRAM for inference. The project is presented with a 2025 arXiv publication date, suggesting it may be relatively new or undergoing active development.
3 days ago
1 day