e2m  by wisupai

Python library for converting files to Markdown, targeting RAG/model training

created 1 year ago
1,193 stars

Top 33.5% on sourcepulse

GitHubView on GitHub
Project Summary

E2M is a Python library designed to convert a wide array of file types into Markdown format, primarily targeting users involved in Retrieval-Augmented Generation (RAG) and model training. It offers a flexible, all-in-one solution for data preparation by abstracting the complexities of parsing and converting diverse content.

How It Works

E2M employs a parser-converter architecture. Parsers are responsible for extracting raw text and image data from various file formats (PDF, DOC, DOCX, EPUB, HTML, URL, PPT, PPTX, MP3, M4A), leveraging engines like marker, unstructured, pandoc, and openai_whisper. Subsequently, converters transform this extracted data into Markdown. Text conversion utilizes engines like litellm, while image conversion also supports litellm, aiming to produce high-quality, structured data suitable for AI applications.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/wisupai/e2m.git or pip install wisup_e2m.
  • Prerequisites: Python 3.10 is recommended. Some parsers/converters may require additional dependencies (e.g., pandoc, unstructured, openai-whisper). API keys are needed for cloud-based services like OpenAI Whisper API or litellm providers.
  • API Service: Start with gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000. API docs available at http://127.0.0.1:8000/docs.
  • Documentation: API Documentation

Highlighted Details

  • Supports conversion of 12 file types including documents, presentations, and audio.
  • Offers multiple engine options for each file type, allowing for flexibility and customization.
  • Includes integrated E2MParser and E2MConverter classes that can be configured via YAML for streamlined workflows.
  • Voice parsing leverages OpenAI Whisper (API and local) for audio-to-text conversion.

Maintenance & Community

  • The project is developed by Wisup, an AI startup focused on data and algorithms.
  • Contact for inquiries: team@wisup.ai. GitHub issues are also monitored.

Licensing & Compatibility

  • Licensed under the MIT License. This permits commercial use and integration with closed-source projects.

Limitations & Caveats

  • The README notes that Image Converter's image recognition capabilities via litellm and zhipuai are "Not Well" and "Not Recommended."
Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
144 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.