Web scraper using OpenAI Functions for selective data extraction
Top 88.6% on sourcepulse
This project provides a Python-based web scraper that leverages OpenAI Functions and LangChain for selective data extraction. It's designed for developers and researchers who need to programmatically extract structured information from websites, simplifying the process of defining data schemas and targeting specific HTML tags.
How It Works
The scraper utilizes Playwright for browser automation and LangChain's integration with OpenAI Functions. Users define a Pydantic schema or dictionary specifying the desired data points. The system then uses this schema to prompt OpenAI, which generates a function call to extract the relevant information from the HTML content of a given URL, focusing on specified tags like <p>
, <span>
, or <h1>
.
Quick Start & Requirements
poetry install --sync
playwright install
OPENAI_API_KEY
in a .env
file.python main.py
Highlighted Details
Maintenance & Community
The project is maintained by trancethehuman. Further community engagement details are not specified in the README.
Licensing & Compatibility
The project's license is not explicitly stated in the README.
Limitations & Caveats
The README advises caution regarding scraping practices and potential legal implications. The project is presented as a contribution to LangChain, suggesting it might be experimental or subject to changes within the LangChain ecosystem.
1 year ago
1 day