entities-extraction-web-scraper  by trancethehuman

Web scraper using OpenAI Functions for selective data extraction

created 2 years ago
306 stars

Top 88.6% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Python-based web scraper that leverages OpenAI Functions and LangChain for selective data extraction. It's designed for developers and researchers who need to programmatically extract structured information from websites, simplifying the process of defining data schemas and targeting specific HTML tags.

How It Works

The scraper utilizes Playwright for browser automation and LangChain's integration with OpenAI Functions. Users define a Pydantic schema or dictionary specifying the desired data points. The system then uses this schema to prompt OpenAI, which generates a function call to extract the relevant information from the HTML content of a given URL, focusing on specified tags like <p>, <span>, or <h1>.

Quick Start & Requirements

  • Install dependencies: poetry install --sync
  • Install Playwright browsers: playwright install
  • Set OPENAI_API_KEY in a .env file.
  • Run: python main.py
  • Requires Python 3.7+ and an OpenAI API key.

Highlighted Details

  • Uses Playwright for robust browser automation.
  • Leverages OpenAI Functions for intelligent data extraction based on defined schemas.
  • Integrates with LangChain, with functionality contributed via a Pull Request.
  • Includes an optional FastAPI server for API deployment.

Maintenance & Community

The project is maintained by trancethehuman. Further community engagement details are not specified in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README.

Limitations & Caveats

The README advises caution regarding scraping practices and potential legal implications. The project is presented as a contribution to LangChain, suggesting it might be experimental or subject to changes within the LangChain ecosystem.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Peter Norvig Peter Norvig(Author of Artificial Intelligence: A Modern Approach; Research Director at Google).

python-openai-demos by pamelafox

0%
374
Python scripts for OpenAI API demos
created 1 year ago
updated 1 week ago
Feedback? Help us improve.