parsera  by raznem

Lightweight library for web scraping with LLMs

created 11 months ago
1,130 stars

Top 34.6% on sourcepulse

GitHubView on GitHub
Project Summary

Parsera is a lightweight Python library designed for web scraping using Large Language Models (LLMs). It simplifies the process of extracting structured data from websites, making it accessible to developers and researchers who need to gather information programmatically. The library offers a straightforward API for defining the data elements to be scraped and leverages LLMs to interpret and extract this information from web pages.

How It Works

Parsera utilizes an LLM to parse HTML content and extract specified elements. The core approach involves defining a dictionary of desired data fields and their natural language descriptions. The library then sends the web page content along with these descriptions to an LLM, which identifies and extracts the relevant information, returning it in a structured JSON format. This method allows for flexible scraping without needing to write complex CSS selectors or XPath queries, adapting to changes in website structure more gracefully.

Quick Start & Requirements

Highlighted Details

  • Simple, declarative interface for defining scrape targets.
  • Supports both synchronous (run) and asynchronous (arun) execution.
  • CLI and Docker options available for integration and deployment.
  • Can be configured to run custom LLM models.

Maintenance & Community

The project appears to be maintained by a single author, @raznem. There are no explicit links to community channels or roadmaps provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. This is a significant omission for evaluating commercial use or integration with closed-source projects.

Limitations & Caveats

The library's reliance on LLMs for parsing may introduce variability in results and potential costs associated with API calls. The absence of a specified license is a major blocker for commercial adoption.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
66 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.