TypeScript library for structured data extraction from webpages using LLMs
Top 8.9% on sourcepulse
This TypeScript library enables extracting structured data from any webpage using Large Language Models (LLMs). It targets developers needing to automate web data extraction and offers a flexible, type-safe approach leveraging LLM function calling for schema conversion.
How It Works
The library utilizes LLM function calling to convert unstructured webpage content into predefined schemas. It supports various LLM providers (OpenAI, Groq, Ollama, GGUF) and integrates with Playwright for browser automation. Data can be processed in different formats: raw HTML, Markdown, extracted text via Readability.js, or screenshots for multi-modal models. Code-generation capabilities allow for creating reusable Playwright scripts based on defined schemas.
Quick Start & Requirements
npm i zod playwright llm-scraper
@ai-sdk/openai
, ollama-ai-provider
) and configuring API keys or local model paths.Highlighted Details
Maintenance & Community
This is an open-source project welcoming community contributions via issues and pull requests.
Licensing & Compatibility
The library's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would depend on the final license.
Limitations & Caveats
The README does not specify the license, which is crucial for commercial adoption. The code-generation feature is noted as new, suggesting potential for early-stage issues or rapid changes.
2 months ago
1 day