Discover and explore top open-source AI tools and projects—updated daily.
m92vyasWebpage pre-processing for LLM-driven data extraction
Top 95.4% on SourcePulse
This library addresses the challenge of preparing raw webpage content for effective use with Large Language Models (LLMs), particularly in Retrieval Augmented Generation (RAG) and AI web scraping pipelines. It offers an open-source alternative to commercial services like Firecrawl and Jina Reader API, simplifying the extraction of text, links, and structured data from web pages to improve LLM accuracy and reduce costs. The project targets developers and researchers seeking robust, cost-effective web data pre-processing solutions.
How It Works
The core functionality relies on two asynchronous functions: get_page_source to fetch raw HTML content from a given URL, and get_processed_text to transform this HTML into a clean, LLM-friendly text format. The project has transitioned from Selenium to Playwright for its backend, enabling asynchronous and concurrent web scraping capabilities. This shift enhances performance and scalability, allowing for more efficient processing of multiple web pages. The processed text is optimized for direct input into LLM prompts, facilitating tasks like data extraction and RAG.
Quick Start & Requirements
pip install git+https://github.com/m92vyas/llm-reader.git
playwright install
playwright install-deps
playwright install).Highlighted Details
get_page_source to bypass blocking, while keeping the text processing free.Maintenance & Community
The project encourages community engagement through GitHub issues and feature requests. Sponsorship is welcomed to support development. Related tools like AI-web_scraper and ParseExtract are also highlighted. No specific community channels (e.g., Discord, Slack) or formal roadmap are detailed in the provided text.
Licensing & Compatibility
The project is released under the MIT License, which permits broad use, including commercial applications and integration into closed-source projects, with minimal restrictions.
Limitations & Caveats
The get_page_source function may still be susceptible to website blocking if not used with appropriate proxying or external scraping services. Users are responsible for managing LLM context window limits, as demonstrated by manual text truncation in the example usage. The example also requires an OpenAI API key and associated costs.
1 month ago
Inactive
hyperbrowserai
ScrapeGraphAI
Dirk Englund(MIT EECS Professor and Cofounder of Axiomatic AI), and
firecrawl