llm-reader  by m92vyas

Webpage pre-processing for LLM-driven data extraction

Created 1 year ago
270 stars

Top 95.4% on SourcePulse

GitHubView on GitHub
Project Summary

This library addresses the challenge of preparing raw webpage content for effective use with Large Language Models (LLMs), particularly in Retrieval Augmented Generation (RAG) and AI web scraping pipelines. It offers an open-source alternative to commercial services like Firecrawl and Jina Reader API, simplifying the extraction of text, links, and structured data from web pages to improve LLM accuracy and reduce costs. The project targets developers and researchers seeking robust, cost-effective web data pre-processing solutions.

How It Works

The core functionality relies on two asynchronous functions: get_page_source to fetch raw HTML content from a given URL, and get_processed_text to transform this HTML into a clean, LLM-friendly text format. The project has transitioned from Selenium to Playwright for its backend, enabling asynchronous and concurrent web scraping capabilities. This shift enhances performance and scalability, allowing for more efficient processing of multiple web pages. The processed text is optimized for direct input into LLM prompts, facilitating tasks like data extraction and RAG.

Quick Start & Requirements

Highlighted Details

  • Provides an open-source alternative to paid services like Firecrawl and Jina Reader API.
  • Leverages Playwright for efficient, concurrent web scraping.
  • Demonstrates structured data extraction (product name, link, image, price) from e-commerce sites, suitable for RAG.
  • Suggests integrating with proxy services or pay-as-you-go scraping APIs for get_page_source to bypass blocking, while keeping the text processing free.

Maintenance & Community

The project encourages community engagement through GitHub issues and feature requests. Sponsorship is welcomed to support development. Related tools like AI-web_scraper and ParseExtract are also highlighted. No specific community channels (e.g., Discord, Slack) or formal roadmap are detailed in the provided text.

Licensing & Compatibility

The project is released under the MIT License, which permits broad use, including commercial applications and integration into closed-source projects, with minimal restrictions.

Limitations & Caveats

The get_page_source function may still be susceptible to website blocking if not used with appropriate proxying or external scraping services. Users are responsible for managing LLM context window limits, as demonstrated by manual text truncation in the example usage. The example also requires an OpenAI API key and associated costs.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Dirk Englund Dirk Englund(MIT EECS Professor and Cofounder of Axiomatic AI), and
25 more.

firecrawl by firecrawl

1.8%
74k
API service for turning websites into LLM-ready data
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.