knowhere  by Ontos-AI

Document memory infrastructure for AI agents

Created 3 weeks ago

New!

552 stars

Top 57.4% on SourcePulse

GitHubView on GitHub
Project Summary

Knowhere provides a memory layer for AI agents and RAG systems, transforming unstructured documents into persistent, navigable context. It addresses the challenge of preparing diverse data formats (PDFs, Office, images, text) for AI by parsing, extracting hierarchical structure, and constructing knowledge graphs. This enables more accurate, efficient, and semantically rich retrieval for LLM workflows, benefiting developers and researchers building advanced AI applications.

How It Works

Knowhere operates in two steps: parsing documents to build memory, and enabling agents to retrieve from it. The "Parse and Build Memory" step utilizes specialized parsers for various file types. Its proprietary tree-like algorithm reconstructs the full document hierarchy, preventing semantic fragmentation, and stores chunks, navigation trees, summaries, and graph links. The "Agentic Retrieval" step fuses multiple signals (keyword, path, semantic) and allows agents to navigate the document's section tree and cross-document graph, drilling into relevant regions for traceable, contextualized evidence. This approach offers a significant advantage over traditional flat vector lookups by mimicking human reading patterns.

Quick Start & Requirements

  • Primary install/run: Use uv for dependency synchronization (uv sync --all-packages), copy environment examples (cp apps/api/.env.example apps/api/.env), start the local development stack (./deploy/local-dev/start-dev.sh), and run the API (cd apps/api && uv run main.py) and worker (cd apps/worker && uv run worker.py) in separate terminals.
  • Prerequisites: Python 3.11+, uv, Docker with Docker Compose. Requires API keys for LLM providers (e.g., DeepSeek, Qwen, OpenAI), S3-compatible storage credentials, and optionally a MinerU API key for PDF parsing. A vision-capable model provider is needed for image summaries and OCR.
  • Resource Footprint: Local development requires running the API, worker, PostgreSQL database, and Redis.
  • Links: Website 🔗, Docs 📄, Self-Host 🏠, Dashboard Overview 🖥️. Cloud API available at knowhereto.ai.

Highlighted Details

  • Performance Benchmark: Agents using Knowhere demonstrate +36% first-try accuracy and +10% recall over raw documents, with 79% accuracy with feedback compared to a ~53% ceiling on raw docs.
  • Multi-modal Parsing: High-fidelity extraction from PDFs, Office documents, and images, preserving hierarchical paths and linking multi-modal assets back to their source chunks.
  • Agentic RAG: A hybrid retrieval engine combining traditional search (RRF) with autonomous agent navigation through structured document graphs.
  • Evidence-based Citations: Every retrieval result is backed by traceable source paths, including document, section, and chunk information.

Maintenance & Community

Knowhere was open-sourced on May 7, 2026. Communication channels include GitHub Discussions for general conversation and GitHub Issues for bug reports and feature requests. A Contribution Guide is available, encouraging community involvement.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license. This license is permissive and generally compatible with commercial use and linking within closed-source applications.

Limitations & Caveats

Support for formats such as .epub, .html, .xml, .mp4, and .mp3 is listed as "Coming Soon". The project is actively expanding benchmarks and adding parsers, indicating ongoing development.

Health Check
Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
77
Issues (30d)
31
Star History
562 stars in the last 27 days

Explore Similar Projects

Feedback? Help us improve.