mcp-crawl4ai-rag by coleam00

Web crawling and RAG for AI agents

Created 8 months ago

1,954 stars

Top 22.3% on SourcePulse

Project Summary

This project provides a Model Context Protocol (MCP) server enabling AI agents and coding assistants to perform web crawling and Retrieval Augmented Generation (RAG). It addresses the need for AI systems to access and utilize external, up-to-date information by scraping websites and integrating the content into a vector database. The primary benefit is enhancing AI capabilities with dynamic knowledge, offering advanced RAG strategies for improved retrieval accuracy and specialized tools for code analysis.

How It Works

The system functions as an MCP server, leveraging web crawling tools to scrape specified URLs. Content is processed, chunked, and stored in a Supabase vector database using embeddings (defaulting to OpenAI). It supports multiple advanced RAG strategies, including contextual embeddings, hybrid search, agentic RAG for code examples, and result reranking. An optional Neo4j knowledge graph component allows for parsing GitHub repositories to analyze code structure and detect AI-generated code hallucinations.

Quick Start & Requirements

Primary Install: Docker (recommended) or direct Python installation using uv.
- Docker: git clone, docker build, create .env.
- Python: git clone, pip install uv, uv venv, activate, uv pip install -e ., crawl4ai-setup, create .env.
Prerequisites: Python 3.12+, Supabase (with pgvector extension), OpenAI API key. Neo4j is optional for knowledge graph features.
Links: Repository: https://github.com/coleam00/mcp-crawl4ai-rag. Local AI Package (for Neo4j): https://github.com/coleam00/local-ai-packaged.git.

Highlighted Details

Advanced RAG Strategies: Configurable options include Contextual Embeddings, Hybrid Search, Agentic RAG (code example extraction), and Reranking.
Knowledge Graph Integration: Parses GitHub repositories into Neo4j for AI hallucination detection and code analysis, offering tools like parse_github_repository and check_ai_script_hallucinations.
Specialized Tools: Provides crawl_single_page, smart_crawl_url, perform_rag_query, and search_code_examples for targeted data retrieval.
MCP Compatibility: Designed for integration with various MCP clients, supporting SSE and stdio transports.

Maintenance & Community

The project is described as a "testbed" and "first version," with plans for significant future improvements and integration into a larger "Archon V2" project. Currently, the author is not actively addressing issues and pull requests but intends to do so. No community links (Discord, Slack, etc.) are provided in the README.

Licensing & Compatibility

The specific open-source license for this repository is not stated in the provided README content.

Limitations & Caveats

The project is in an early, experimental ("testbed") stage, with ongoing development planned. The knowledge graph functionality is noted as not fully compatible with Docker, recommending direct Python execution for its use. Performance for agentic RAG and repository parsing may be slow due to LLM calls or large codebase analysis. Active issue and PR management is currently limited.

mcp-crawl4ai-rag by coleam00

Explore Similar Projects

Mantic.sh by marcoaapfortes

deepresearch by scienceaix

DiscovAI-search by DiscovAI

chunkhound by chunkhound

coding_agent_session_search by Dicklesworthstone

awesome-web-agents by steel-dev

AITreasureBox by superiorlu

reflex-llm-examples by reflex-dev

awesome-generative-ai-data-scientist by business-science

exa-mcp-server by exa-labs

OpenDeepSearch by sentient-agi

claude-context by zilliztech