e2m by wisupai

Python library for converting files to Markdown, targeting RAG/model training

Created 1 year ago

1,247 stars

Top 31.5% on SourcePulse

Project Summary

E2M is a Python library designed to convert a wide array of file types into Markdown format, primarily targeting users involved in Retrieval-Augmented Generation (RAG) and model training. It offers a flexible, all-in-one solution for data preparation by abstracting the complexities of parsing and converting diverse content.

How It Works

E2M employs a parser-converter architecture. Parsers are responsible for extracting raw text and image data from various file formats (PDF, DOC, DOCX, EPUB, HTML, URL, PPT, PPTX, MP3, M4A), leveraging engines like marker, unstructured, pandoc, and openai_whisper. Subsequently, converters transform this extracted data into Markdown. Text conversion utilizes engines like litellm, while image conversion also supports litellm, aiming to produce high-quality, structured data suitable for AI applications.

Quick Start & Requirements

Installation: pip install git+https://github.com/wisupai/e2m.git or pip install wisup_e2m.
Prerequisites: Python 3.10 is recommended. Some parsers/converters may require additional dependencies (e.g., pandoc, unstructured, openai-whisper). API keys are needed for cloud-based services like OpenAI Whisper API or litellm providers.
API Service: Start with gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000. API docs available at http://127.0.0.1:8000/docs.
Documentation: API Documentation

Highlighted Details

Supports conversion of 12 file types including documents, presentations, and audio.
Offers multiple engine options for each file type, allowing for flexibility and customization.
Includes integrated E2MParser and E2MConverter classes that can be configured via YAML for streamlined workflows.
Voice parsing leverages OpenAI Whisper (API and local) for audio-to-text conversion.

Maintenance & Community

The project is developed by Wisup, an AI startup focused on data and algorithms.
Contact for inquiries: team@wisup.ai. GitHub issues are also monitored.

Licensing & Compatibility

Licensed under the MIT License. This permits commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that Image Converter's image recognition capabilities via litellm and zhipuai are "Not Well" and "Not Recommended."

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

Starred by

Dan Abramov

Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App).

chatgpt-md-translator by smikitky

CLI tool for translating Markdown documents using the ChatGPT API

Created 2 years ago

Updated 9 months ago

ingest by sammcj

Markdown generator for LLM ingestion

Created 1 year ago

Updated 1 month ago

Versatile-OCR-Program by ses4255

OCR pipeline for ML training datasets from documents

Created 9 months ago

Updated 7 months ago

inkdown by 1943time

WYSIWYG editor for enhanced Markdown workflows

Created 2 years ago

Updated 3 weeks ago

vision-parse by iamarunbrahma

CLI tool for parsing PDFs into markdown using vision LLMs

Created 1 year ago

Updated 3 months ago

markdownify-mcp by zcaceres

MCP server for converting files/web content to Markdown

Created 1 year ago

Updated 4 months ago

Starred by

Michael Chiang

Michael Chiang(Cofounder of Ollama).

Ollama-OCR by imanoop7

OCR package for extracting text from images/PDFs using vision language models via Ollama

Created 1 year ago

Updated 10 months ago

gptpdf by CosmosShadow

CLI tool for parsing PDFs into Markdown using GPT models

Created 1 year ago

Updated 8 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify).

omniparse by adithya-s-k

Data ingestion/parsing platform for GenAI

Created 1 year ago

Updated 1 month ago

Starred by

Lyumin Zhang

Lyumin Zhang(Author of ControlNet).

manga-image-translator by zyddnys

Image translator for manga/images, supporting multiple languages

Created 4 years ago

Updated 3 weeks ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect).

yn by purocean

Extensible Markdown editor for productivity

Created 8 years ago

Updated 2 weeks ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

20 more.

markitdown by microsoft

Python tool for converting files to Markdown for LLM text analysis

Created 1 year ago

Updated 2 days ago

Feedback? Help us improve.