llm-reader by m92vyas

Webpage pre-processing for LLM-driven data extraction

Created 1 year ago

280 stars

Top 93.1% on SourcePulse

Project Summary

This library addresses the challenge of preparing raw webpage content for effective use with Large Language Models (LLMs), particularly in Retrieval Augmented Generation (RAG) and AI web scraping pipelines. It offers an open-source alternative to commercial services like Firecrawl and Jina Reader API, simplifying the extraction of text, links, and structured data from web pages to improve LLM accuracy and reduce costs. The project targets developers and researchers seeking robust, cost-effective web data pre-processing solutions.

How It Works

The core functionality relies on two asynchronous functions: get_page_source to fetch raw HTML content from a given URL, and get_processed_text to transform this HTML into a clean, LLM-friendly text format. The project has transitioned from Selenium to Playwright for its backend, enabling asynchronous and concurrent web scraping capabilities. This shift enhances performance and scalability, allowing for more efficient processing of multiple web pages. The processed text is optimized for direct input into LLM prompts, facilitating tasks like data extraction and RAG.

Quick Start & Requirements

Installation:

pip install git+https://github.com/m92vyas/llm-reader.git
playwright install
playwright install-deps

Prerequisites: Python, Playwright browser binaries (installed via playwright install).
Documentation: https://github.com/m92vyas/llm-reader/wiki/Documentation

Highlighted Details

Provides an open-source alternative to paid services like Firecrawl and Jina Reader API.
Leverages Playwright for efficient, concurrent web scraping.
Demonstrates structured data extraction (product name, link, image, price) from e-commerce sites, suitable for RAG.
Suggests integrating with proxy services or pay-as-you-go scraping APIs for get_page_source to bypass blocking, while keeping the text processing free.

Maintenance & Community

The project encourages community engagement through GitHub issues and feature requests. Sponsorship is welcomed to support development. Related tools like AI-web_scraper and ParseExtract are also highlighted. No specific community channels (e.g., Discord, Slack) or formal roadmap are detailed in the provided text.

Licensing & Compatibility

The project is released under the MIT License, which permits broad use, including commercial applications and integration into closed-source projects, with minimal restrictions.

Limitations & Caveats

The get_page_source function may still be susceptible to website blocking if not used with appropriate proxying or external scraping services. Users are responsible for managing LLM context window limits, as demonstrated by manual text truncation in the example usage. The example also requires an OpenAI API key and associated costs.

llm-reader by m92vyas

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

deepscrape by stretchcloud

nicar-2025-scraping by simonw

scraperai by scraperai

create-llmstxt-py by firecrawl

mcp by hyperbrowserai

parsera by raznem

entities-extraction-web-scraper by trancethehuman

AnyCrawl by any4ai

llm-scraper by mishushakov

Scrapegraph-ai by ScrapeGraphAI

firecrawl by firecrawl