parsera  by raznem

Lightweight library for web scraping with LLMs

Created 1 year ago
1,216 stars

Top 32.2% on SourcePulse

GitHubView on GitHub
Project Summary

Parsera is a lightweight Python library designed for web scraping using Large Language Models (LLMs). It simplifies the process of extracting structured data from websites, making it accessible to developers and researchers who need to gather information programmatically. The library offers a straightforward API for defining the data elements to be scraped and leverages LLMs to interpret and extract this information from web pages.

How It Works

Parsera utilizes an LLM to parse HTML content and extract specified elements. The core approach involves defining a dictionary of desired data fields and their natural language descriptions. The library then sends the web page content along with these descriptions to an LLM, which identifies and extracts the relevant information, returning it in a structured JSON format. This method allows for flexible scraping without needing to write complex CSS selectors or XPath queries, adapting to changes in website structure more gracefully.

Quick Start & Requirements

Highlighted Details

  • Simple, declarative interface for defining scrape targets.
  • Supports both synchronous (run) and asynchronous (arun) execution.
  • CLI and Docker options available for integration and deployment.
  • Can be configured to run custom LLM models.

Maintenance & Community

The project appears to be maintained by a single author, @raznem. There are no explicit links to community channels or roadmaps provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. This is a significant omission for evaluating commercial use or integration with closed-source projects.

Limitations & Caveats

The library's reliance on LLMs for parsing may introduce variability in results and potential costs associated with API calls. The absence of a specified license is a major blocker for commercial adoption.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.