entities-extraction-web-scraper  by trancethehuman

Web scraper using OpenAI Functions for selective data extraction

Created 2 years ago
307 stars

Top 87.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a Python-based web scraper that leverages OpenAI Functions and LangChain for selective data extraction. It's designed for developers and researchers who need to programmatically extract structured information from websites, simplifying the process of defining data schemas and targeting specific HTML tags.

How It Works

The scraper utilizes Playwright for browser automation and LangChain's integration with OpenAI Functions. Users define a Pydantic schema or dictionary specifying the desired data points. The system then uses this schema to prompt OpenAI, which generates a function call to extract the relevant information from the HTML content of a given URL, focusing on specified tags like <p>, <span>, or <h1>.

Quick Start & Requirements

  • Install dependencies: poetry install --sync
  • Install Playwright browsers: playwright install
  • Set OPENAI_API_KEY in a .env file.
  • Run: python main.py
  • Requires Python 3.7+ and an OpenAI API key.

Highlighted Details

  • Uses Playwright for robust browser automation.
  • Leverages OpenAI Functions for intelligent data extraction based on defined schemas.
  • Integrates with LangChain, with functionality contributed via a Pull Request.
  • Includes an optional FastAPI server for API deployment.

Maintenance & Community

The project is maintained by trancethehuman. Further community engagement details are not specified in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README.

Limitations & Caveats

The README advises caution regarding scraping practices and potential legal implications. The project is presented as a contribution to LangChain, suggesting it might be experimental or subject to changes within the LangChain ecosystem.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.