extractor  by lightfeed

LLM and AI browser automation for robust web data extraction

Created 11 months ago
307 stars

Top 87.5% on SourcePulse

GitHubView on GitHub
Project Summary

This TypeScript library provides a robust solution for extracting structured data from web pages using Large Language Models (LLMs) and browser automation. It targets developers building data pipelines, competitor intelligence tools, or any application requiring reliable web scraping, offering natural language prompting for navigation and extraction, enhanced token efficiency, and improved accuracy.

How It Works

The extractor leverages Playwright for browser automation, supporting local, serverless, and remote browser instances with built-in stealth capabilities to avoid detection. It converts HTML content into an LLM-friendly Markdown format, optionally focusing on main content and cleaning URLs. Data extraction is driven by LLMs (integrated via LangChain providers like OpenAI, Gemini, Anthropic, Ollama) using Zod schemas to define the desired output structure. Novel features include JSON recovery for sanitizing LLM outputs and AI-powered browser navigation through the companion @lightfeed/browser-agent package.

Quick Start & Requirements

  • Installation: npm install @lightfeed/extractor
  • LLM Provider Installation: Install specific LangChain integrations (e.g., @langchain/openai, @langchain/google-genai).
  • Prerequisites: Node.js, npm/yarn, TypeScript. API keys are required for cloud LLM providers. Playwright browser binaries are managed by the library or require explicit configuration for serverless/remote setups.
  • Documentation: Examples provided in the README; further details available via @lightfeed/browser-agent documentation.

Highlighted Details

  • LLM-ready Markdown conversion with options for main content extraction, URL cleaning, and image inclusion.
  • Structured data extraction using LLMs with Zod schemas, supporting JSON mode.
  • JSON Recovery utility (safeSanitizedParser) to sanitize and recover partial or malformed LLM outputs.
  • Robust URL validation and handling, including repair of markdown-escaped links and resolution of relative URLs.
  • Stealth Mode Browser Automation with anti-bot patches and proxy configuration.
  • AI Browser Navigation capabilities when paired with @lightfeed/browser-agent.

Maintenance & Community

  • Support: Direct assistance via email at support@lightfeed.ai or by opening issues on the GitHub repository.
  • Community: No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: The Apache 2.0 license is permissive, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

For OpenAI models, optional schema fields are not directly supported and require using .nullable() instead of .optional(). The URL cleaning feature is currently specialized for Amazon product URLs. Serverless browser deployments require specific setup with @sparticuz/chromium.

Health Check
Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)
33
Issues (30d)
0
Star History
248 stars in the last 30 days

Explore Similar Projects

Starred by Will Brown Will Brown(Research Lead at Prime Intellect), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

stagehand by browserbase

0.5%
22k
AI browser automation framework for production
Created 2 years ago
Updated 16 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Dirk Englund Dirk Englund(MIT EECS Professor and Cofounder of Axiomatic AI), and
25 more.

firecrawl by firecrawl

2.9%
105k
API service for turning websites into LLM-ready data
Created 2 years ago
Updated 21 hours ago
Feedback? Help us improve.