CLI tool for site crawling to generate custom GPT knowledge files
Top 2.0% on sourcepulse
This project provides a tool to crawl websites and generate knowledge files for creating custom GPTs or assistants. It's designed for developers and users who want to leverage specific website content within OpenAI's GPT ecosystem, offering a streamlined way to ingest and utilize web data.
How It Works
The crawler utilizes Node.js to navigate a specified website, extracting text content based on a provided CSS selector. It follows links matching a given pattern, respecting limits on pages crawled and file size. The extracted text is then compiled into a single JSON file, optimized for upload to OpenAI's platform for custom GPT creation.
Quick Start & Requirements
npm i
npm start
config.ts
.Highlighted Details
config.ts
with options for URL, link matching, content selectors, page limits, and output filenames.Maintenance & Community
The project is maintained by Builder.io. Contributions are welcomed via pull requests.
Licensing & Compatibility
The project is licensed under the MIT License, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The effectiveness of the generated knowledge file is highly dependent on the quality of the CSS selector and the structure of the target website. Large websites may require careful configuration of maxPagesToCrawl
, maxFileSize
, and maxTokens
to manage output file size.
3 weeks ago
Inactive