ai.robots.txt  by ai-robots-txt

Open list for blocking AI web crawlers

created 1 year ago
2,926 stars

Top 16.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a curated, community-driven list of AI agents and web crawlers to block, primarily for protecting websites from unauthorized data scraping for AI training. It offers configuration snippets for various web servers, enabling site owners to easily implement these blocks and safeguard their content.

How It Works

The project maintains a robots.json file where new AI bots are added. A GitHub Actions workflow automatically generates updated robots.txt, .htaccess, Nginx, and HAProxy configuration files from this source data. This approach ensures a centralized, version-controlled list that is easily deployable across different server environments, leveraging standard web server directives for bot exclusion.

Quick Start & Requirements

  • robots.txt: Directly use the provided robots.txt file.
  • Apache (.htaccess): Include the .htaccess file in your Apache configuration.
  • Nginx: Include the nginx-block-ai-bots.conf snippet in your Nginx virtual host.
  • HAProxy: Add haproxy-block-ai-bots.txt to your HAProxy config directory and include the provided ACL and http-request deny rules in your frontend.
  • Testing: Requires Python 3.

Highlighted Details

  • Community-driven list of AI crawlers.
  • Configuration files for robots.txt, Apache (.htaccess), Nginx, and HAProxy.
  • Automated generation of configuration files via GitHub Actions.
  • Encourages contributions via pull requests to robots.json.

Maintenance & Community

Updates are managed via pull requests to robots.json. Users can subscribe to release updates via RSS/Atom feed or GitHub's "Watch" feature. The project acknowledges contributions from "Dark Visitors" and links to resources for reporting abusive crawlers.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. However, the use of robots.txt is governed by RFC 9309. The configuration files are generally compatible with their respective web servers (Apache, Nginx, HAProxy).

Limitations & Caveats

The effectiveness of .htaccess for blocking is noted as potentially less performant than other methods. The README does not specify a license, which could impact commercial use or integration into closed-source projects. The list's comprehensiveness depends on community contributions.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
4
Star History
413 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.