olmocr  by allenai

Toolkit for linearizing PDFs for LLM datasets/training

Created 1 year ago
14,094 stars

Top 3.5% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit addresses the challenge of extracting and linearizing text from diverse PDF documents for large language model (LLM) training datasets. It offers a comprehensive pipeline for processing millions of PDFs, targeting researchers and engineers working with document understanding and LLM fine-tuning.

How It Works

The core of olmOCR leverages a prompting strategy with GPT-4o for high-quality text parsing from PDF images. It processes PDFs through a fine-tuned Qwen2-VL and Molmo-O model, utilizing Sglang for efficient batch inference. The output is structured in Dolma-style JSONL, facilitating easy integration into LLM training workflows.

Quick Start & Requirements

  • Installation: pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
  • Prerequisites: NVIDIA GPU (>=20GB VRAM), 30GB disk space, poppler-utils, and Microsoft Core Fonts.
  • Setup: Requires conda environment setup and installing dependencies.
  • Demo: Online demo available at https://olmocr.allenai.org/.

Highlighted Details

  • Utilizes a GPT-4o prompting strategy for robust PDF text extraction.
  • Supports batch processing of millions of PDFs via S3 integration and cluster execution (e.g., Beaker).
  • Includes fine-tuning code for Qwen2-VL and Molmo-O models.
  • Offers a side-by-side evaluation toolkit for comparing pipeline versions.

Maintenance & Community

Developed and maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2).

Licensing & Compatibility

Licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Requires a recent NVIDIA GPU with substantial VRAM for inference. The project is presented with a 2025 arXiv publication date, suggesting it may be relatively new or undergoing active development.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
12
Issues (30d)
12
Star History
269 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.