markitdown by microsoft

Python tool for converting files to Markdown for LLM text analysis

Created 1 year ago

85,099 stars

Top 0.1% on SourcePulse

View on GitHub

22 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Han Wang

Cofounder of Mintlify

Gregor Zunic

Cofounder of Browser Use

and 18 more!

Project Summary

MarkItDown is a Python utility designed to convert a wide array of file formats, including Office documents, PDFs, images, and audio, into Markdown. It targets developers and researchers working with Large Language Models (LLMs) and text analysis pipelines, aiming to preserve document structure for better LLM comprehension and token efficiency.

How It Works

MarkItDown processes various file types by leveraging specific optional dependencies, allowing users to install only what they need. For complex documents like PDFs or Office files, it aims to retain structural elements such as headings, lists, and tables in Markdown format. It also supports extracting metadata and performing OCR or speech transcription for images and audio, respectively, integrating with LLMs for enhanced content understanding.

Quick Start & Requirements

Install with: pip install 'markitdown[all]'
Optional dependencies can be installed individually (e.g., pip install 'markitdown[pdf, docx]').
Supports integration with Azure Document Intelligence and LLM clients (e.g., OpenAI) for advanced features.
See GitHub repository for full details.

Highlighted Details

Supports conversion of PDF, PowerPoint, Word, Excel, images (OCR/EXIF), audio (transcription/EXIF), HTML, ZIP, YouTube URLs, and EPUBs.
Offers an MCP server for integration with LLM applications like Claude Desktop.
Extensible via a plugin system, with a sample plugin available for development.
Can utilize Azure Document Intelligence for enhanced document conversion.

Maintenance & Community

This project is maintained by Microsoft. Contributions are welcomed, with specific issues and PRs marked for community involvement. It follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Recent breaking changes (v0.0.1 to v0.1.0) require updating dependencies and adapting to stream-based file handling in the DocumentConverter class. While optimized for LLM consumption, the Markdown output may not be ideal for high-fidelity human-readable document conversions.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1,104 stars in the last 30 days