Discover and explore top open-source AI tools and projects—updated daily.
Web crawling and RAG for AI agents
Top 25.3% on SourcePulse
This project provides a Model Context Protocol (MCP) server enabling AI agents and coding assistants to perform web crawling and Retrieval Augmented Generation (RAG). It addresses the need for AI systems to access and utilize external, up-to-date information by scraping websites and integrating the content into a vector database. The primary benefit is enhancing AI capabilities with dynamic knowledge, offering advanced RAG strategies for improved retrieval accuracy and specialized tools for code analysis.
How It Works
The system functions as an MCP server, leveraging web crawling tools to scrape specified URLs. Content is processed, chunked, and stored in a Supabase vector database using embeddings (defaulting to OpenAI). It supports multiple advanced RAG strategies, including contextual embeddings, hybrid search, agentic RAG for code examples, and result reranking. An optional Neo4j knowledge graph component allows for parsing GitHub repositories to analyze code structure and detect AI-generated code hallucinations.
Quick Start & Requirements
uv
.
git clone
, docker build
, create .env
.git clone
, pip install uv
, uv venv
, activate, uv pip install -e .
, crawl4ai-setup
, create .env
.pgvector
extension), OpenAI API key. Neo4j is optional for knowledge graph features.https://github.com/coleam00/mcp-crawl4ai-rag
. Local AI Package (for Neo4j): https://github.com/coleam00/local-ai-packaged.git
.Highlighted Details
parse_github_repository
and check_ai_script_hallucinations
.crawl_single_page
, smart_crawl_url
, perform_rag_query
, and search_code_examples
for targeted data retrieval.Maintenance & Community
The project is described as a "testbed" and "first version," with plans for significant future improvements and integration into a larger "Archon V2" project. Currently, the author is not actively addressing issues and pull requests but intends to do so. No community links (Discord, Slack, etc.) are provided in the README.
Licensing & Compatibility
The specific open-source license for this repository is not stated in the provided README content.
Limitations & Caveats
The project is in an early, experimental ("testbed") stage, with ongoing development planned. The knowledge graph functionality is noted as not fully compatible with Docker, recommending direct Python execution for its use. Performance for agentic RAG and repository parsing may be slow due to LLM calls or large codebase analysis. Active issue and PR management is currently limited.
1 month ago
1+ week