codemogger  by glommer

Local code indexing and search for AI coding agents

Created 2 weeks ago

New!

260 stars

Top 97.6% on SourcePulse

GitHubView on GitHub
Project Summary

A local, self-contained code indexing library and MCP server for AI coding agents, Codemogger addresses the need for efficient code understanding. It parses source files using tree-sitter, semantically chunks them into logical units, and stores these along with local embeddings in a single SQLite database. This enables AI agents to perform fast, precise keyword searches and nuanced semantic queries without relying on external servers or API keys, streamlining codebase navigation and comprehension.

How It Works

The system scans codebases, respecting .gitignore, and leverages tree-sitter (WASM) to generate Abstract Syntax Trees (ASTs) for semantic chunking of definitions like functions, structs, and classes. These chunks are then encoded using a local embedding model (defaulting to all-MiniLM-L6-v2) and stored in an embedded SQLite database. This database integrates Full-Text Search (FTS) for keyword matching and vector search for semantic similarity. Incremental indexing efficiently updates the database by re-processing only modified files based on SHA-256 hashes.

Quick Start & Requirements

  • Installation: Global npm install (npm install -g codemogger) or via npx.
  • Prerequisites: Node.js/npm. Supports 13 languages including Rust, C/C++, Go, Python, Zig, Java, Scala, JavaScript/TypeScript, PHP, and Ruby.
  • Usage: Index a project with codemogger index ./my-project and search using codemogger search "query". It can be integrated as an MCP server via a JSON configuration.
  • SDK: A TypeScript SDK is available, allowing users to provide their own embedding functions.
  • Links: No external documentation or demo links are provided in the README.

Highlighted Details

  • Performance: Keyword search is significantly faster (25x-370x) than ripgrep and yields precise definitions. Semantic search excels at finding relevant code via natural language queries, outperforming keyword-based tools when exact terms are unknown.
  • Efficiency: Utilizes int8 quantized embeddings (395 bytes/chunk) and a quantized embedding model for reduced storage and faster local CPU processing.
  • Single DB: Consolidates multiple codebases into a single SQLite file, simplifying management and deployment.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: The permissive MIT license supports commercial use and integration into closed-source applications.

Limitations & Caveats

The README does not detail specific limitations, alpha status, or known bugs. Performance benchmarks are based on an Apple M2 (8GB) and may vary across different hardware configurations. The tool's effectiveness is dependent on the quality of tree-sitter grammars for supported languages.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
2
Star History
262 stars in the last 17 days

Explore Similar Projects

Feedback? Help us improve.