thepipe  by emcf

SDK for extracting data from documents

Created 1 year ago
1,300 stars

Top 30.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This package provides a unified interface for extracting clean, structured data from a wide variety of complex document formats, including PDFs, web pages, presentations, and multimedia files. It leverages Vision-Language Models (VLMs) for enhanced extraction quality and is designed for seamless integration with LLMs, vector databases, and RAG frameworks, targeting developers and data scientists working with diverse document sources.

How It Works

Thepipe employs a combination of computer vision models and heuristics for robust content scraping. It performs AI-native filetype detection, layout analysis, and structured data extraction. For multimodal sources like web pages, videos, and audio, it integrates tools like Whisper for transcription and frame extraction, and Playwright for web scraping. The extracted content can be processed into various chunking formats (by document, page, length, section, or semantically) for downstream LLM or RAG pipeline consumption.

Quick Start & Requirements

  • Install via pip: pip install thepipe-api
  • For media-rich sources, install dependencies: apt-get update && apt-get install -y git ffmpeg and python -m playwright install --with-deps chromium.
  • Requires an OpenAI API key (or custom LLM server configuration via LLM_SERVER_BASE_URL and LLM_SERVER_API_KEY).
  • See: Official Docs

Highlighted Details

  • Supports a broad range of input types including URLs, PDFs, DOCX, PPTX, MP4, MP3, Jupyter Notebooks, CSV, TXT, images, ZIP files, directories, YouTube videos, Tweets, and GitHub repositories.
  • Offers multiple chunking strategies, including experimental semantic and agentic chunking.
  • Provides structured data extraction capabilities using a defined schema.
  • Can be configured to host images locally for persistent storage.

Maintenance & Community

The project is seeking sponsors to support its maintenance and development. Links to Cal.com and Trellis AI are provided as examples of supported open-source projects.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project notes potential issues with YouTube video scraping due to unofficial API usage and may require user agent modifications. Tweet scraping via unofficial APIs is also prone to breaking. The semantic and agentic chunking methods are marked as experimental.

Health Check
Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.