vectordb by kagisearch

Python package for local, embeddings-based text retrieval

Created 2 years ago

776 stars

Top 45.2% on SourcePulse

Project Summary

A minimal Python package for local, end-to-end text retrieval using embeddings and vector search. It's designed for low latency and small memory footprints, powering AI features in Kagi Search. The target audience includes developers needing efficient, self-contained semantic search capabilities.

How It Works

VectorDB stores text content, automatically chunking long documents. It associates optional metadata with each chunk and uses configurable embedding models (e.g., BAAI, Universal Sentence Encoder, custom HuggingFace models) to generate vector representations. Retrieval is performed via semantic search, returning the most relevant chunks based on query embeddings. For performance, it leverages Faiss for smaller datasets and mrpt for larger ones.

Quick Start & Requirements

Install via pip: pip install vectordb2
Requirements: Python. GPU acceleration is supported but not strictly required.
Usage examples and detailed documentation are available in the README.

Highlighted Details

Supports multiple embedding models, including options for "fast", "normal", "best", and multilingual, plus custom HuggingFace models.
Offers configurable chunking strategies (sliding window with overlap, or paragraph-based).
Includes options for persistent storage (memory_file) and controlling search result diversity (batch_results).
Provides performance benchmarks for various embedding models on CPU and GPU, alongside latency metrics.

Maintenance & Community

The project is associated with Kagi Search. Further community or roadmap information is not detailed in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README does not specify limitations regarding maximum data size or potential performance bottlenecks on extremely large datasets beyond the Faiss/mrpt optimization. The project appears to be actively used within Kagi Search.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days