olmocr  by allenai

Toolkit for linearizing PDFs for LLM datasets/training

created 10 months ago
13,403 stars

Top 3.8% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit addresses the challenge of extracting and linearizing text from diverse PDF documents for large language model (LLM) training datasets. It offers a comprehensive pipeline for processing millions of PDFs, targeting researchers and engineers working with document understanding and LLM fine-tuning.

How It Works

The core of olmOCR leverages a prompting strategy with GPT-4o for high-quality text parsing from PDF images. It processes PDFs through a fine-tuned Qwen2-VL and Molmo-O model, utilizing Sglang for efficient batch inference. The output is structured in Dolma-style JSONL, facilitating easy integration into LLM training workflows.

Quick Start & Requirements

  • Installation: pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
  • Prerequisites: NVIDIA GPU (>=20GB VRAM), 30GB disk space, poppler-utils, and Microsoft Core Fonts.
  • Setup: Requires conda environment setup and installing dependencies.
  • Demo: Online demo available at https://olmocr.allenai.org/.

Highlighted Details

  • Utilizes a GPT-4o prompting strategy for robust PDF text extraction.
  • Supports batch processing of millions of PDFs via S3 integration and cluster execution (e.g., Beaker).
  • Includes fine-tuning code for Qwen2-VL and Molmo-O models.
  • Offers a side-by-side evaluation toolkit for comparing pipeline versions.

Maintenance & Community

Developed and maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2).

Licensing & Compatibility

Licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Requires a recent NVIDIA GPU with substantial VRAM for inference. The project is presented with a 2025 arXiv publication date, suggesting it may be relatively new or undergoing active development.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
11
Issues (30d)
76
Star History
1,381 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.