chunkhound by chunkhound

Deep code and file research engine for AI assistants

Created 9 months ago

1,119 stars

Top 34.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jason Knight

Director AI Compilers at NVIDIA; Cofounder of OctoML

Project Summary

This project addresses the challenge of making codebases deeply searchable for AI assistants by transforming them into knowledge bases. It targets engineers, researchers, and power users who need to understand complex code relationships and discover features semantically. ChunkHound offers a local-first, privacy-preserving solution that enhances AI-assisted code research and development.

How It Works

ChunkHound leverages the research-backed cAST (Chunking via Abstract Syntax Trees) algorithm for semantic code chunking, preserving code meaning through structure-aware parsing. It employs Multi-Hop Semantic Search to uncover interconnected code relationships beyond simple keyword matches, enabling natural language queries like "find authentication code" to discover related components. The system operates on a local-first architecture, ensuring code privacy and enabling offline use with local models. It supports structured parsing for 29 languages via Tree-sitter and custom parsers, providing consistent semantic understanding across diverse codebases.

Quick Start & Requirements

Installation: Requires Python 3.10+ and the uv package manager. Install uv via curl -LsSf https://astral.sh/uv/install.sh | sh, then install ChunkHound with uv tool install chunkhound.
Prerequisites: An API key for semantic search is optional (e.g., OpenAI, VoyageAI, or local Ollama); regex search functions without keys.
Setup: Create a .chunkhound.json configuration file (e.g., specifying embedding provider and API key). Index your codebase using chunkhound index.
Documentation: Comprehensive guides are available at chunkhound.github.io.

Highlighted Details

Research Foundation: Built on the cAST algorithm, demonstrating significant gains in recall and code generation benchmarks (RepoEval, SWE-bench).
Local-First Architecture: Code remains on the user's machine, supporting offline use with Ollama and avoiding per-token costs for local models.
Universal Language Support: Structured parsing for 29 languages, including common programming languages and configuration files.
Intelligent Code Discovery: Multi-hop search, automatic feature pattern discovery (e.g., finding "authentication" yields related code), and convergence detection.
Real-Time Indexing: Features automatic file watching, efficient updates via smart content diffs, seamless Git branch switching, and live memory systems for documentation.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmap were provided in the README excerpt.

Licensing & Compatibility

License: MIT.
Compatibility: The MIT license is permissive, generally allowing for commercial use and integration within closed-source projects.

Limitations & Caveats

Advanced semantic search capabilities require configuration with external API keys or a local Ollama setup. The README details complex exclusion and workspace overlay configurations that may require careful tuning. While benchmarks are cited, real-world performance may vary based on codebase size and complexity.

Health Check

Last Commit

18 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

118 stars in the last 30 days