docling by docling-project

Prepare documents for generative AI

Created 2 years ago

62,980 stars

Top 0.5% on SourcePulse

View on GitHub

18 Experts Love This Project

Carol Willing

Core Contributor to CPython, Jupyter

Jeff Hammerbacher

Cofounder of Cloudera

Tim J. Baek

Founder of Open WebUI

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

and 14 more!

Project Summary

Docling simplifies document processing for generative AI applications by parsing a wide array of formats, including advanced PDF understanding, and offering seamless integration with popular AI frameworks. It benefits users by preparing diverse documents for AI workflows, supporting local execution for sensitive data, and providing a unified representation.

How It Works

Docling parses numerous document types such as PDF, DOCX, PPTX, images, and audio, with a focus on advanced PDF analysis including layout, tables, and OCR. It employs a unified DoclingDocument representation and supports various export formats. This approach is advantageous due to its comprehensive format support, deep PDF capabilities, local execution for privacy, and plug-and-play integrations with AI ecosystems like LangChain and LlamaIndex.

Quick Start & Requirements

Install: pip install docling
Prerequisites: Works on macOS, Linux, and Windows (x86_64, arm64). MLX acceleration is available for SmolDocling on Apple Silicon.
Resources: No specific hardware requirements mentioned for basic use.
Links: Documentation, Examples, Integrations, Technical Report.

Highlighted Details

Parses diverse formats: PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images, and more.
Advanced PDF understanding: layout, reading order, tables, OCR, image classification.
Supports Visual Language Models (VLMs) like SmolDocling, leveraging MLX acceleration on Apple Silicon.
Integrates with LangChain, LlamaIndex, Crew AI, Haystack via an MCP server.

Maintenance & Community

Hosted by the LF AI & Data Foundation, the project originated from IBM Research Zurich. Community support and discussions are available via the project's discussion section.

Licensing & Compatibility

The Docling codebase is licensed under the MIT license. Individual models used within Docling may have their own licenses, which require separate review for commercial use or closed-source linking.

Limitations & Caveats

Structured information extraction is currently in beta. Features like metadata extraction, chart understanding, and complex chemistry understanding are listed as "coming soon," indicating they are not yet available.

Health Check

Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)

137

Issues (30d)

Star History

1,697 stars in the last 30 days