llm-scraper  by mishushakov

TypeScript library for structured data extraction from webpages using LLMs

created 1 year ago
5,902 stars

Top 8.9% on sourcepulse

GitHubView on GitHub
Project Summary

This TypeScript library enables extracting structured data from any webpage using Large Language Models (LLMs). It targets developers needing to automate web data extraction and offers a flexible, type-safe approach leveraging LLM function calling for schema conversion.

How It Works

The library utilizes LLM function calling to convert unstructured webpage content into predefined schemas. It supports various LLM providers (OpenAI, Groq, Ollama, GGUF) and integrates with Playwright for browser automation. Data can be processed in different formats: raw HTML, Markdown, extracted text via Readability.js, or screenshots for multi-modal models. Code-generation capabilities allow for creating reusable Playwright scripts based on defined schemas.

Quick Start & Requirements

  • Install: npm i zod playwright llm-scraper
  • LLM Provider Setup: Requires installing specific provider packages (e.g., @ai-sdk/openai, ollama-ai-provider) and configuring API keys or local model paths.
  • Dependencies: Node.js, Playwright, Zod, and an LLM provider.
  • Setup: Basic setup involves installing packages and initializing the LLM and scraper.
  • Docs: https://github.com/mishushakov/llm-scraper

Highlighted Details

  • Supports local LLM providers like Ollama and GGUF models.
  • Full type-safety with TypeScript and Zod schemas.
  • Offers streaming output for partial results.
  • Includes a code-generation feature to create reusable scraping scripts.

Maintenance & Community

This is an open-source project welcoming community contributions via issues and pull requests.

Licensing & Compatibility

The library's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the final license.

Limitations & Caveats

The README does not specify the license, which is crucial for commercial adoption. The code-generation feature is noted as new, suggesting potential for early-stage issues or rapid changes.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
1,132 stars in the last 90 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

instructor-js by 567-labs

0%
738
Typescript tool for structured extraction from LLMs
created 1 year ago
updated 6 months ago
Feedback? Help us improve.