ai.robots.txt by ai-robots-txt

Open list for blocking AI web crawlers

Created 1 year ago

3,225 stars

Top 14.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

This repository provides a curated, community-driven list of AI agents and web crawlers to block, primarily for protecting websites from unauthorized data scraping for AI training. It offers configuration snippets for various web servers, enabling site owners to easily implement these blocks and safeguard their content.

How It Works

The project maintains a robots.json file where new AI bots are added. A GitHub Actions workflow automatically generates updated robots.txt, .htaccess, Nginx, and HAProxy configuration files from this source data. This approach ensures a centralized, version-controlled list that is easily deployable across different server environments, leveraging standard web server directives for bot exclusion.

Quick Start & Requirements

robots.txt: Directly use the provided robots.txt file.
Apache (.htaccess): Include the .htaccess file in your Apache configuration.
Nginx: Include the nginx-block-ai-bots.conf snippet in your Nginx virtual host.
HAProxy: Add haproxy-block-ai-bots.txt to your HAProxy config directory and include the provided ACL and http-request deny rules in your frontend.
Testing: Requires Python 3.

Highlighted Details

Community-driven list of AI crawlers.
Configuration files for robots.txt, Apache (.htaccess), Nginx, and HAProxy.
Automated generation of configuration files via GitHub Actions.
Encourages contributions via pull requests to robots.json.

Maintenance & Community

Updates are managed via pull requests to robots.json. Users can subscribe to release updates via RSS/Atom feed or GitHub's "Watch" feature. The project acknowledges contributions from "Dark Visitors" and links to resources for reporting abusive crawlers.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. However, the use of robots.txt is governed by RFC 9309. The configuration files are generally compatible with their respective web servers (Apache, Nginx, HAProxy).

Limitations & Caveats

The effectiveness of .htaccess for blocking is noted as potentially less performant than other methods. The README does not specify a license, which could impact commercial use or integration into closed-source projects. The list's comprehensiveness depends on community contributions.

Health Check

Last Commit

8 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

116 stars in the last 30 days