thepipe by emcf

SDK for extracting data from documents

Created 1 year ago

1,521 stars

Top 26.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dharmesh Shah

Cofounder of HubSpot

Project Summary

This package provides a unified interface for extracting clean, structured data from a wide variety of complex document formats, including PDFs, web pages, presentations, and multimedia files. It leverages Vision-Language Models (VLMs) for enhanced extraction quality and is designed for seamless integration with LLMs, vector databases, and RAG frameworks, targeting developers and data scientists working with diverse document sources.

How It Works

Thepipe employs a combination of computer vision models and heuristics for robust content scraping. It performs AI-native filetype detection, layout analysis, and structured data extraction. For multimodal sources like web pages, videos, and audio, it integrates tools like Whisper for transcription and frame extraction, and Playwright for web scraping. The extracted content can be processed into various chunking formats (by document, page, length, section, or semantically) for downstream LLM or RAG pipeline consumption.

Quick Start & Requirements

Install via pip: pip install thepipe-api
For media-rich sources, install dependencies: apt-get update && apt-get install -y git ffmpeg and python -m playwright install --with-deps chromium.
Requires an OpenAI API key (or custom LLM server configuration via LLM_SERVER_BASE_URL and LLM_SERVER_API_KEY).
See: Official Docs

Highlighted Details

Supports a broad range of input types including URLs, PDFs, DOCX, PPTX, MP4, MP3, Jupyter Notebooks, CSV, TXT, images, ZIP files, directories, YouTube videos, Tweets, and GitHub repositories.
Offers multiple chunking strategies, including experimental semantic and agentic chunking.
Provides structured data extraction capabilities using a defined schema.
Can be configured to host images locally for persistent storage.

Maintenance & Community

The project is seeking sponsors to support its maintenance and development. Links to Cal.com and Trellis AI are provided as examples of supported open-source projects.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project notes potential issues with YouTube video scraping due to unofficial API usage and may require user agent modifications. Tweet scraping via unofficial APIs is also prone to breaking. The semantic and agentic chunking methods are marked as experimental.

Health Check

Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days