olmocr by allenai

Toolkit for linearizing PDFs for LLM datasets/training

Created 1 year ago

16,677 stars

Top 2.9% on SourcePulse

View on GitHub

6 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jeff Hammerbacher

Cofounder of Cloudera

Dan Guido

Cofounder of Trail of Bits

Elvis Saravia

Founder of DAIR.AI

and 2 more!

Project Summary

This toolkit addresses the challenge of extracting and linearizing text from diverse PDF documents for large language model (LLM) training datasets. It offers a comprehensive pipeline for processing millions of PDFs, targeting researchers and engineers working with document understanding and LLM fine-tuning.

How It Works

The core of olmOCR leverages a prompting strategy with GPT-4o for high-quality text parsing from PDF images. It processes PDFs through a fine-tuned Qwen2-VL and Molmo-O model, utilizing Sglang for efficient batch inference. The output is structured in Dolma-style JSONL, facilitating easy integration into LLM training workflows.

Quick Start & Requirements

Installation: pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
Prerequisites: NVIDIA GPU (>=20GB VRAM), 30GB disk space, poppler-utils, and Microsoft Core Fonts.
Setup: Requires conda environment setup and installing dependencies.
Demo: Online demo available at https://olmocr.allenai.org/.

Highlighted Details

Utilizes a GPT-4o prompting strategy for robust PDF text extraction.
Supports batch processing of millions of PDFs via S3 integration and cluster execution (e.g., Beaker).
Includes fine-tuning code for Qwen2-VL and Molmo-O models.
Offers a side-by-side evaluation toolkit for comparing pipeline versions.

Maintenance & Community

Developed and maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2).

Licensing & Compatibility

Licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Requires a recent NVIDIA GPU with substantial VRAM for inference. The project is presented with a 2025 arXiv publication date, suggesting it may be relatively new or undergoing active development.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

534 stars in the last 30 days