docling  by docling-project

Prepare documents for generative AI

Created 1 year ago
38,918 stars

Top 0.8% on SourcePulse

GitHubView on GitHub
Project Summary

Docling simplifies document processing for generative AI applications by parsing a wide array of formats, including advanced PDF understanding, and offering seamless integration with popular AI frameworks. It benefits users by preparing diverse documents for AI workflows, supporting local execution for sensitive data, and providing a unified representation.

How It Works

Docling parses numerous document types such as PDF, DOCX, PPTX, images, and audio, with a focus on advanced PDF analysis including layout, tables, and OCR. It employs a unified DoclingDocument representation and supports various export formats. This approach is advantageous due to its comprehensive format support, deep PDF capabilities, local execution for privacy, and plug-and-play integrations with AI ecosystems like LangChain and LlamaIndex.

Quick Start & Requirements

  • Install: pip install docling
  • Prerequisites: Works on macOS, Linux, and Windows (x86_64, arm64). MLX acceleration is available for SmolDocling on Apple Silicon.
  • Resources: No specific hardware requirements mentioned for basic use.
  • Links: Documentation, Examples, Integrations, Technical Report.

Highlighted Details

  • Parses diverse formats: PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images, and more.
  • Advanced PDF understanding: layout, reading order, tables, OCR, image classification.
  • Supports Visual Language Models (VLMs) like SmolDocling, leveraging MLX acceleration on Apple Silicon.
  • Integrates with LangChain, LlamaIndex, Crew AI, Haystack via an MCP server.

Maintenance & Community

Hosted by the LF AI & Data Foundation, the project originated from IBM Research Zurich. Community support and discussions are available via the project's discussion section.

Licensing & Compatibility

The Docling codebase is licensed under the MIT license. Individual models used within Docling may have their own licenses, which require separate review for commercial use or closed-source linking.

Limitations & Caveats

Structured information extraction is currently in beta. Features like metadata extraction, chart understanding, and complex chemistry understanding are listed as "coming soon," indicating they are not yet available.

Health Check
Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)
47
Issues (30d)
115
Star History
2,712 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.