e2m  by wisupai

Python library for converting files to Markdown, targeting RAG/model training

Created 1 year ago
1,231 stars

Top 32.0% on SourcePulse

GitHubView on GitHub
Project Summary

E2M is a Python library designed to convert a wide array of file types into Markdown format, primarily targeting users involved in Retrieval-Augmented Generation (RAG) and model training. It offers a flexible, all-in-one solution for data preparation by abstracting the complexities of parsing and converting diverse content.

How It Works

E2M employs a parser-converter architecture. Parsers are responsible for extracting raw text and image data from various file formats (PDF, DOC, DOCX, EPUB, HTML, URL, PPT, PPTX, MP3, M4A), leveraging engines like marker, unstructured, pandoc, and openai_whisper. Subsequently, converters transform this extracted data into Markdown. Text conversion utilizes engines like litellm, while image conversion also supports litellm, aiming to produce high-quality, structured data suitable for AI applications.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/wisupai/e2m.git or pip install wisup_e2m.
  • Prerequisites: Python 3.10 is recommended. Some parsers/converters may require additional dependencies (e.g., pandoc, unstructured, openai-whisper). API keys are needed for cloud-based services like OpenAI Whisper API or litellm providers.
  • API Service: Start with gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000. API docs available at http://127.0.0.1:8000/docs.
  • Documentation: API Documentation

Highlighted Details

  • Supports conversion of 12 file types including documents, presentations, and audio.
  • Offers multiple engine options for each file type, allowing for flexibility and customization.
  • Includes integrated E2MParser and E2MConverter classes that can be configured via YAML for streamlined workflows.
  • Voice parsing leverages OpenAI Whisper (API and local) for audio-to-text conversion.

Maintenance & Community

  • The project is developed by Wisup, an AI startup focused on data and algorithms.
  • Contact for inquiries: team@wisup.ai. GitHub issues are also monitored.

Licensing & Compatibility

  • Licensed under the MIT License. This permits commercial use and integration with closed-source projects.

Limitations & Caveats

  • The README notes that Image Converter's image recognition capabilities via litellm and zhipuai are "Not Well" and "Not Recommended."
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
20 more.

markitdown by microsoft

6.7%
77k
Python tool for converting files to Markdown for LLM text analysis
Created 10 months ago
Updated 1 week ago
Feedback? Help us improve.