gpt-crawler  by BuilderIO

CLI tool for site crawling to generate custom GPT knowledge files

created 1 year ago
21,763 stars

Top 2.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a tool to crawl websites and generate knowledge files for creating custom GPTs or assistants. It's designed for developers and users who want to leverage specific website content within OpenAI's GPT ecosystem, offering a streamlined way to ingest and utilize web data.

How It Works

The crawler utilizes Node.js to navigate a specified website, extracting text content based on a provided CSS selector. It follows links matching a given pattern, respecting limits on pages crawled and file size. The extracted text is then compiled into a single JSON file, optimized for upload to OpenAI's platform for custom GPT creation.

Quick Start & Requirements

Highlighted Details

  • Generates knowledge files for custom GPTs from website URLs.
  • Configurable via config.ts with options for URL, link matching, content selectors, page limits, and output filenames.
  • Supports excluding specific resource types and limiting file size/token count.
  • Offers Docker and API server alternatives for running the crawler.

Maintenance & Community

The project is maintained by Builder.io. Contributions are welcomed via pull requests.

Licensing & Compatibility

The project is licensed under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The effectiveness of the generated knowledge file is highly dependent on the quality of the CSS selector and the structure of the target website. Large websites may require careful configuration of maxPagesToCrawl, maxFileSize, and maxTokens to manage output file size.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
0
Star History
417 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.