Discover and explore top open-source AI tools and projects—updated daily.
lightfeedLLM and AI browser automation for robust web data extraction
Top 87.5% on SourcePulse
This TypeScript library provides a robust solution for extracting structured data from web pages using Large Language Models (LLMs) and browser automation. It targets developers building data pipelines, competitor intelligence tools, or any application requiring reliable web scraping, offering natural language prompting for navigation and extraction, enhanced token efficiency, and improved accuracy.
How It Works
The extractor leverages Playwright for browser automation, supporting local, serverless, and remote browser instances with built-in stealth capabilities to avoid detection. It converts HTML content into an LLM-friendly Markdown format, optionally focusing on main content and cleaning URLs. Data extraction is driven by LLMs (integrated via LangChain providers like OpenAI, Gemini, Anthropic, Ollama) using Zod schemas to define the desired output structure. Novel features include JSON recovery for sanitizing LLM outputs and AI-powered browser navigation through the companion @lightfeed/browser-agent package.
Quick Start & Requirements
npm install @lightfeed/extractor@langchain/openai, @langchain/google-genai).@lightfeed/browser-agent documentation.Highlighted Details
safeSanitizedParser) to sanitize and recover partial or malformed LLM outputs.@lightfeed/browser-agent.Maintenance & Community
support@lightfeed.ai or by opening issues on the GitHub repository.Licensing & Compatibility
Limitations & Caveats
For OpenAI models, optional schema fields are not directly supported and require using .nullable() instead of .optional(). The URL cleaning feature is currently specialized for Amazon product URLs. Serverless browser deployments require specific setup with @sparticuz/chromium.
17 hours ago
Inactive
browserbase
firecrawl