llm-scraper by mishushakov

TypeScript library for structured data extraction from webpages using LLMs

Created 1 year ago

6,151 stars

Top 8.3% on SourcePulse

View on GitHub

7 Experts Love This Project

and 3 more!

Project Summary

This TypeScript library enables extracting structured data from any webpage using Large Language Models (LLMs). It targets developers needing to automate web data extraction and offers a flexible, type-safe approach leveraging LLM function calling for schema conversion.

How It Works

The library utilizes LLM function calling to convert unstructured webpage content into predefined schemas. It supports various LLM providers (OpenAI, Groq, Ollama, GGUF) and integrates with Playwright for browser automation. Data can be processed in different formats: raw HTML, Markdown, extracted text via Readability.js, or screenshots for multi-modal models. Code-generation capabilities allow for creating reusable Playwright scripts based on defined schemas.

Quick Start & Requirements

Install: npm i zod playwright llm-scraper
LLM Provider Setup: Requires installing specific provider packages (e.g., @ai-sdk/openai, ollama-ai-provider) and configuring API keys or local model paths.
Dependencies: Node.js, Playwright, Zod, and an LLM provider.
Setup: Basic setup involves installing packages and initializing the LLM and scraper.
Docs: https://github.com/mishushakov/llm-scraper

Highlighted Details

Supports local LLM providers like Ollama and GGUF models.
Full type-safety with TypeScript and Zod schemas.
Offers streaming output for partial results.
Includes a code-generation feature to create reusable scraping scripts.

Maintenance & Community

This is an open-source project welcoming community contributions via issues and pull requests.

Licensing & Compatibility

The library's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the final license.

Limitations & Caveats

The README does not specify the license, which is crucial for commercial adoption. The code-generation feature is noted as new, suggesting potential for early-stage issues or rapid changes.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

30 stars in the last 30 days