entities-extraction-web-scraper by trancethehuman

Web scraper using OpenAI Functions for selective data extraction

Created 2 years ago

308 stars

Top 87.3% on SourcePulse

Project Summary

This project provides a Python-based web scraper that leverages OpenAI Functions and LangChain for selective data extraction. It's designed for developers and researchers who need to programmatically extract structured information from websites, simplifying the process of defining data schemas and targeting specific HTML tags.

How It Works

The scraper utilizes Playwright for browser automation and LangChain's integration with OpenAI Functions. Users define a Pydantic schema or dictionary specifying the desired data points. The system then uses this schema to prompt OpenAI, which generates a function call to extract the relevant information from the HTML content of a given URL, focusing on specified tags like <p>, <span>, or <h1>.

Quick Start & Requirements

Install dependencies: poetry install --sync
Install Playwright browsers: playwright install
Set OPENAI_API_KEY in a .env file.
Run: python main.py
Requires Python 3.7+ and an OpenAI API key.

Highlighted Details

Uses Playwright for robust browser automation.
Leverages OpenAI Functions for intelligent data extraction based on defined schemas.
Integrates with LangChain, with functionality contributed via a Pull Request.
Includes an optional FastAPI server for API deployment.

Maintenance & Community

The project is maintained by trancethehuman. Further community engagement details are not specified in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README.

Limitations & Caveats

The README advises caution regarding scraping practices and potential legal implications. The project is presented as a contribution to LangChain, suggesting it might be experimental or subject to changes within the LangChain ecosystem.

entities-extraction-web-scraper by trancethehuman

Explore Similar Projects

llm-reader by m92vyas

nicar-2025-scraping by simonw

gpt-automated-web-scraper by djb-gt

mcp by hyperbrowserai

parsera by raznem

scrapeghost by jamesturk

thepipe by emcf

CyberScraper-2077 by itsOwen

AI-Web-Scraper by techwithtim

trafilatura by adbar

Scrapegraph-ai by ScrapeGraphAI

firecrawl by firecrawl