extractor by lightfeed

LLM and AI browser automation for robust web data extraction

Created 1 year ago

318 stars

Top 84.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Travis Fischer

Founder of Agentic

Jonathan Ragan-Kelley

Professor at MIT

Project Summary

This TypeScript library provides a robust solution for extracting structured data from web pages using Large Language Models (LLMs) and browser automation. It targets developers building data pipelines, competitor intelligence tools, or any application requiring reliable web scraping, offering natural language prompting for navigation and extraction, enhanced token efficiency, and improved accuracy.

How It Works

The extractor leverages Playwright for browser automation, supporting local, serverless, and remote browser instances with built-in stealth capabilities to avoid detection. It converts HTML content into an LLM-friendly Markdown format, optionally focusing on main content and cleaning URLs. Data extraction is driven by LLMs (integrated via LangChain providers like OpenAI, Gemini, Anthropic, Ollama) using Zod schemas to define the desired output structure. Novel features include JSON recovery for sanitizing LLM outputs and AI-powered browser navigation through the companion @lightfeed/browser-agent package.

Quick Start & Requirements

Installation: npm install @lightfeed/extractor
LLM Provider Installation: Install specific LangChain integrations (e.g., @langchain/openai, @langchain/google-genai).
Prerequisites: Node.js, npm/yarn, TypeScript. API keys are required for cloud LLM providers. Playwright browser binaries are managed by the library or require explicit configuration for serverless/remote setups.
Documentation: Examples provided in the README; further details available via @lightfeed/browser-agent documentation.

Highlighted Details

LLM-ready Markdown conversion with options for main content extraction, URL cleaning, and image inclusion.
Structured data extraction using LLMs with Zod schemas, supporting JSON mode.
JSON Recovery utility (safeSanitizedParser) to sanitize and recover partial or malformed LLM outputs.
Robust URL validation and handling, including repair of markdown-escaped links and resolution of relative URLs.
Stealth Mode Browser Automation with anti-bot patches and proxy configuration.
AI Browser Navigation capabilities when paired with @lightfeed/browser-agent.

Maintenance & Community

Support: Direct assistance via email at support@lightfeed.ai or by opening issues on the GitHub repository.
Community: No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: The Apache 2.0 license is permissive, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

For OpenAI models, optional schema fields are not directly supported and require using .nullable() instead of .optional(). The URL cleaning feature is currently specialized for Amazon product URLs. Serverless browser deployments require specific setup with @sparticuz/chromium.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days