duckdb-vss  by duckdb

Vector similarity search extension for DuckDB

Created 2 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This experimental DuckDB extension provides Vector Similarity Search (VSS) capabilities directly within DuckDB, enabling efficient nearest neighbor searches on vector data stored in FLOAT ARRAY columns. It targets data scientists and engineers seeking to integrate VSS into their analytical workflows without relying on separate vector databases, offering performance gains for similarity-based queries.

How It Works

The extension integrates the usearch library to implement Hierarchical Navigable Small Worlds (HNSW) indexes. These indexes are exposed as a custom index type within DuckDB, compatible with its fixed-size ARRAY data type (introduced in v0.10.0). Queries involving ordering by distance metrics (array_distance, array_cosine_distance, array_negative_inner_product) against indexed FLOAT arrays, combined with a LIMIT clause, are accelerated via an HNSW_INDEX_SCAN operation.

Quick Start & Requirements

  • Primary install/run command: Build from source using make. The primary executable is ./build/release/duckdb, which includes the extension. The loadable extension binary is ./build/release/extension/vss/vss.duckdb_extension.
  • Non-default prerequisites: DuckDB version v0.10.0 or later. Requires a C++ build environment.
  • Links: No external quick-start or documentation links are provided in the README.

Highlighted Details

  • Supports Euclidean (l2sq), Cosine, and Inner Product (ip) distance metrics.
  • Supports inserting, updating, and deleting rows after index creation.
  • Index data is persisted to disk with DuckDB's disk-backed databases, serialized on checkpoint and deserialized on restart (deferred until first access).

Maintenance & Community

No specific details regarding contributors, sponsorships, community channels (e.g., Discord/Slack), or roadmaps are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license type or provide compatibility notes for commercial use.

Limitations & Caveats

Currently, only vectors consisting of FLOAT types are supported. The HNSW index must fit entirely in RAM. Deletions are marked rather than immediate, potentially impacting query quality and performance over time, necessitating manual re-compaction (PRAGMA hnsw_compact_index) or index re-creation. Index serialization/deserialization during database checkpoints and restarts can be time-consuming for large indexes.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.