markitdown  by microsoft

Python tool for converting files to Markdown for LLM text analysis

created 8 months ago
69,896 stars

Top 0.2% on sourcepulse

GitHubView on GitHub
Project Summary

MarkItDown is a Python utility designed to convert a wide array of file formats, including Office documents, PDFs, images, and audio, into Markdown. It targets developers and researchers working with Large Language Models (LLMs) and text analysis pipelines, aiming to preserve document structure for better LLM comprehension and token efficiency.

How It Works

MarkItDown processes various file types by leveraging specific optional dependencies, allowing users to install only what they need. For complex documents like PDFs or Office files, it aims to retain structural elements such as headings, lists, and tables in Markdown format. It also supports extracting metadata and performing OCR or speech transcription for images and audio, respectively, integrating with LLMs for enhanced content understanding.

Quick Start & Requirements

  • Install with: pip install 'markitdown[all]'
  • Optional dependencies can be installed individually (e.g., pip install 'markitdown[pdf, docx]').
  • Supports integration with Azure Document Intelligence and LLM clients (e.g., OpenAI) for advanced features.
  • See GitHub repository for full details.

Highlighted Details

  • Supports conversion of PDF, PowerPoint, Word, Excel, images (OCR/EXIF), audio (transcription/EXIF), HTML, ZIP, YouTube URLs, and EPUBs.
  • Offers an MCP server for integration with LLM applications like Claude Desktop.
  • Extensible via a plugin system, with a sample plugin available for development.
  • Can utilize Azure Document Intelligence for enhanced document conversion.

Maintenance & Community

This project is maintained by Microsoft. Contributions are welcomed, with specific issues and PRs marked for community involvement. It follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Recent breaking changes (v0.0.1 to v0.1.0) require updating dependencies and adapting to stream-based file handling in the DocumentConverter class. While optimized for LLM consumption, the Markdown output may not be ideal for high-fidelity human-readable document conversions.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
39
Issues (30d)
30
Star History
15,064 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.