Discover and explore top open-source AI tools and projects—updated daily.
wisupaiPython library for converting files to Markdown, targeting RAG/model training
Top 31.9% on SourcePulse
E2M is a Python library designed to convert a wide array of file types into Markdown format, primarily targeting users involved in Retrieval-Augmented Generation (RAG) and model training. It offers a flexible, all-in-one solution for data preparation by abstracting the complexities of parsing and converting diverse content.
How It Works
E2M employs a parser-converter architecture. Parsers are responsible for extracting raw text and image data from various file formats (PDF, DOC, DOCX, EPUB, HTML, URL, PPT, PPTX, MP3, M4A), leveraging engines like marker, unstructured, pandoc, and openai_whisper. Subsequently, converters transform this extracted data into Markdown. Text conversion utilizes engines like litellm, while image conversion also supports litellm, aiming to produce high-quality, structured data suitable for AI applications.
Quick Start & Requirements
pip install git+https://github.com/wisupai/e2m.git or pip install wisup_e2m.pandoc, unstructured, openai-whisper). API keys are needed for cloud-based services like OpenAI Whisper API or litellm providers.gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000. API docs available at http://127.0.0.1:8000/docs.Highlighted Details
E2MParser and E2MConverter classes that can be configured via YAML for streamlined workflows.Maintenance & Community
team@wisup.ai. GitHub issues are also monitored.Licensing & Compatibility
Limitations & Caveats
litellm and zhipuai are "Not Well" and "Not Recommended."1 year ago
Inactive
microsoft