Open list for blocking AI web crawlers
Top 16.7% on sourcepulse
This repository provides a curated, community-driven list of AI agents and web crawlers to block, primarily for protecting websites from unauthorized data scraping for AI training. It offers configuration snippets for various web servers, enabling site owners to easily implement these blocks and safeguard their content.
How It Works
The project maintains a robots.json
file where new AI bots are added. A GitHub Actions workflow automatically generates updated robots.txt
, .htaccess
, Nginx, and HAProxy configuration files from this source data. This approach ensures a centralized, version-controlled list that is easily deployable across different server environments, leveraging standard web server directives for bot exclusion.
Quick Start & Requirements
robots.txt
file..htaccess
file in your Apache configuration.nginx-block-ai-bots.conf
snippet in your Nginx virtual host.haproxy-block-ai-bots.txt
to your HAProxy config directory and include the provided ACL and http-request deny
rules in your frontend.Highlighted Details
robots.txt
, Apache (.htaccess
), Nginx, and HAProxy.robots.json
.Maintenance & Community
Updates are managed via pull requests to robots.json
. Users can subscribe to release updates via RSS/Atom feed or GitHub's "Watch" feature. The project acknowledges contributions from "Dark Visitors" and links to resources for reporting abusive crawlers.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. However, the use of robots.txt
is governed by RFC 9309. The configuration files are generally compatible with their respective web servers (Apache, Nginx, HAProxy).
Limitations & Caveats
The effectiveness of .htaccess
for blocking is noted as potentially less performant than other methods. The README does not specify a license, which could impact commercial use or integration into closed-source projects. The list's comprehensiveness depends on community contributions.
1 day ago
1 day