site2pdf  by laiso

Website to PDF converter for RAG

Created 1 year ago
256 stars

Top 98.7% on SourcePulse

GitHubView on GitHub
Project Summary

This tool generates comprehensive PDFs of entire websites, ideal for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks. It targets users needing to consolidate web content into a portable, visually preserved format for AI integration.

How It Works

The tool leverages Puppeteer to navigate a website, identify sub-links matching a provided URL pattern (or defaulting to the main domain), and then uses pdf-lib to generate and merge individual PDFs for each page into a single document. This approach preserves visual information and creates a unified dataset suitable for multimodal AI models.

Quick Start & Requirements

  • Primary install/run command: npx site2pdf-cli <main_url> [url_pattern]
  • Prerequisites: Node.js. Linux dependencies include libxkbcommon0, libnss3, libxss1, libasound2, fonts-liberation, libappindicator3-1, libatk-bridge2.0-0, libatspi2.0-0, libgtk-3-0, libgbm-dev.
  • Windows troubleshooting involves granting specific permissions via icacls.
  • Official docs/demo: Not explicitly linked, but usage examples are provided.

Highlighted Details

  • Generates a single PDF from multiple website pages.
  • Preserves visual information (images) for multimodal AI.
  • Supports filtering sub-links with regular expressions.
  • Outputs PDF with a slugified filename based on the main URL.

Maintenance & Community

  • The project is open for contributions via issues and pull requests.
  • No specific community channels (Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

  • The README does not specify a license.

Limitations & Caveats

The tool is noted as being "still under development" and may have limitations. Specific compatibility with commercial or closed-source applications is not detailed. Windows users may need to address specific permission issues for Puppeteer.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
1
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.