semble by MinishLab

Fast, accurate code search for AI agents

Created 2 months ago

5,503 stars

Top 9.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Tim Suchanek

Founder of expand.ai

Project Summary

Semble: Fast and Accurate Code Search for Agents

Semble is a specialized code search library engineered for AI agents, designed to drastically reduce token consumption and latency compared to traditional tools like grep+read. It provides agents with instant access to precise code snippets, enabling faster and more efficient code understanding and generation workflows. By indexing and searching entire codebases in under a second, Semble aims to be a foundational tool for agent-based software development.

How It Works

Semble employs a hybrid retrieval strategy combining semantic and lexical search. It segments code files into manageable chunks using the "Chonkie" library. Queries are then scored against these chunks using two complementary methods: static Model2Vec embeddings derived from a code-specialized model for semantic similarity, and BM25 for lexical matching of identifiers and API names. These scores are fused using Reciprocal Rank Fusion (RRF). The results are further refined by a sophisticated set of code-aware ranking signals, including adaptive weighting based on query type (natural language vs. symbol-like), boosting definitions of queried symbols, matching identifier stems, promoting file coherence, and penalizing noise from test or example files. This multi-stage approach allows for high accuracy and speed, running entirely on CPU without requiring computationally expensive transformer forward passes at query time.

Quick Start & Requirements

Primary install: pip install semble or uv add semble.
Prerequisites: Runs on CPU; no API keys, GPU, or external services are required. uv is recommended for MCP server setup.
Usage: Available via a Python API (SembleIndex.from_path, SembleIndex.from_git, index.search, index.find_related) and a standalone CLI (semble search, semble find-related).
MCP Server: Can be configured as an MCP server for agents like Claude Code, Cursor, Codex, and OpenCode, supporting on-demand cloning and indexing of repositories. Configuration examples are provided in the README.

Highlighted Details

Performance: Indexes repositories in approximately 250 ms and answers queries in ~1.5 ms, all on CPU.
Accuracy: Achieves an NDCG@10 score of 0.854, rivaling code-specialized transformer models.
Token Efficiency: Utilizes ~98% fewer tokens than grep+read by returning only relevant code chunks.
Agent Integration: Functions as a drop-in MCP server, seamlessly integrating with various AI coding assistants.

Maintenance & Community

The provided README does not detail specific maintenance schedules, notable contributors, sponsorships, or community channels (e.g., Discord, Slack). The project is authored by Thomas van Dongen and Stephan Tulkens.

Licensing & Compatibility

Semble is released under the MIT license. This permissive license allows for broad compatibility with commercial use and integration into closed-source projects.

Limitations & Caveats

While Semble offers superior speed and token efficiency for agent-based code search, the README suggests that traditional grep remains preferable for exhaustive literal string matching or quick confirmation of exact text. Additionally, Semble actively down-ranks code found in test files, compatibility shims, example directories, and declaration stubs, which may be a limitation if these specific code types are the primary search target.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

662 stars in the last 30 days