parsera by raznem

Lightweight library for web scraping with LLMs

Created 1 year ago

1,254 stars

Top 31.4% on SourcePulse

Project Summary

Parsera is a lightweight Python library designed for web scraping using Large Language Models (LLMs). It simplifies the process of extracting structured data from websites, making it accessible to developers and researchers who need to gather information programmatically. The library offers a straightforward API for defining the data elements to be scraped and leverages LLMs to interpret and extract this information from web pages.

How It Works

Parsera utilizes an LLM to parse HTML content and extract specified elements. The core approach involves defining a dictionary of desired data fields and their natural language descriptions. The library then sends the web page content along with these descriptions to an LLM, which identifies and extracts the relevant information, returning it in a structured JSON format. This method allows for flexible scraping without needing to write complex CSS selectors or XPath queries, adapting to changes in website structure more gracefully.

Quick Start & Requirements

Install via pip: pip install parsera
Requires Playwright: playwright install
Requires PARSERA_API_KEY environment variable (or OPENAI_API_KEY for CLI).
Documentation: https://github.com/raznem/parsera#documentation

Highlighted Details

Simple, declarative interface for defining scrape targets.
Supports both synchronous (run) and asynchronous (arun) execution.
CLI and Docker options available for integration and deployment.
Can be configured to run custom LLM models.

Maintenance & Community

The project appears to be maintained by a single author, @raznem. There are no explicit links to community channels or roadmaps provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. This is a significant omission for evaluating commercial use or integration with closed-source projects.

Limitations & Caveats

The library's reliance on LLMs for parsing may introduce variability in results and potential costs associated with API calls. The absence of a specified license is a major blocker for commercial adoption.

parsera by raznem

Explore Similar Projects

oxylabs-ai-studio-py by oxylabs

llm-reader by m92vyas

deepscrape by stretchcloud

nicar-2025-scraping by simonw

create-llmstxt-py by firecrawl

mcp by hyperbrowserai

llm-api-engine by developersdigest

entities-extraction-web-scraper by trancethehuman

llm-scraper by mishushakov

firecrawl-mcp-server by firecrawl

Scrapegraph-ai by ScrapeGraphAI

firecrawl by firecrawl