Python library for converting files to Markdown, targeting RAG/model training
Top 33.5% on sourcepulse
E2M is a Python library designed to convert a wide array of file types into Markdown format, primarily targeting users involved in Retrieval-Augmented Generation (RAG) and model training. It offers a flexible, all-in-one solution for data preparation by abstracting the complexities of parsing and converting diverse content.
How It Works
E2M employs a parser-converter architecture. Parsers are responsible for extracting raw text and image data from various file formats (PDF, DOC, DOCX, EPUB, HTML, URL, PPT, PPTX, MP3, M4A), leveraging engines like marker
, unstructured
, pandoc
, and openai_whisper
. Subsequently, converters transform this extracted data into Markdown. Text conversion utilizes engines like litellm
, while image conversion also supports litellm
, aiming to produce high-quality, structured data suitable for AI applications.
Quick Start & Requirements
pip install git+https://github.com/wisupai/e2m.git
or pip install wisup_e2m
.pandoc
, unstructured
, openai-whisper
). API keys are needed for cloud-based services like OpenAI Whisper API or litellm
providers.gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
. API docs available at http://127.0.0.1:8000/docs
.Highlighted Details
E2MParser
and E2MConverter
classes that can be configured via YAML for streamlined workflows.Maintenance & Community
team@wisup.ai
. GitHub issues are also monitored.Licensing & Compatibility
Limitations & Caveats
litellm
and zhipuai
are "Not Well" and "Not Recommended."10 months ago
1 day