gpt-crawler by BuilderIO

CLI tool for site crawling to generate custom GPT knowledge files

Created 2 years ago

22,173 stars

Top 2.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Travis Fischer

Founder of Agentic

Pawel Garbacki

Cofounder of Fireworks AI

Dharmesh Shah

Cofounder of HubSpot

Project Summary

This project provides a tool to crawl websites and generate knowledge files for creating custom GPTs or assistants. It's designed for developers and users who want to leverage specific website content within OpenAI's GPT ecosystem, offering a streamlined way to ingest and utilize web data.

How It Works

The crawler utilizes Node.js to navigate a specified website, extracting text content based on a provided CSS selector. It follows links matching a given pattern, respecting limits on pages crawled and file size. The extracted text is then compiled into a single JSON file, optimized for upload to OpenAI's platform for custom GPT creation.

Quick Start & Requirements

Install dependencies: npm i
Run crawler: npm start
Requires Node.js >= 16.
Configuration is done via config.ts.
Official docs: https://github.com/builderio/gpt-crawler

Highlighted Details

Generates knowledge files for custom GPTs from website URLs.
Configurable via config.ts with options for URL, link matching, content selectors, page limits, and output filenames.
Supports excluding specific resource types and limiting file size/token count.
Offers Docker and API server alternatives for running the crawler.

Maintenance & Community

The project is maintained by Builder.io. Contributions are welcomed via pull requests.

Licensing & Compatibility

The project is licensed under the MIT License, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The effectiveness of the generated knowledge file is highly dependent on the quality of the CSS selector and the structure of the target website. Large websites may require careful configuration of maxPagesToCrawl, maxFileSize, and maxTokens to manage output file size.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

69 stars in the last 30 days