jvector by datastax

Embedded vector search engine

Created 2 years ago

1,669 stars

Top 25.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

JVector is an advanced, embedded vector search engine designed for developers and researchers needing efficient approximate nearest neighbor (ANN) search. It offers a flexible, graph-based indexing approach that merges DiskANN and HNSW techniques, enabling fast, accurate, and scalable vector similarity searches, particularly for high-dimensional data.

How It Works

JVector implements a multi-layer graph index, leveraging Vamana (from DiskANN) within each layer, inspired by HNSW's hierarchical structure. It supports non-blocking concurrency for scalable index construction. The design features an in-memory adjacency list for upper layers and an on-disk adjacency list for the bottom layer. It utilizes two-pass search with optional vector compression (Product Quantization, Binary Quantization, Fused ADC) for reduced memory usage and latency while preserving accuracy. A key innovation is its ability to build larger-than-memory indexes using two-pass searches during construction.

Quick Start & Requirements

Install/Run: Primarily a Java library. Examples can be run via Maven: mvn compile exec:exec@bench or mvn compile exec:exec@sift.
Prerequisites: Java 11+ required. Java 20+ recommended for optimized vector providers (SIMD via Panama Vector API).
Resources: Benchmarks suggest memory bandwidth saturation can occur; a PhysicalCoreExecutor is used by default to limit operations to physical core count, configurable via -Djvector.physical_core_count.
Links: Examples, SiftSmall, Bench

Highlighted Details

Merges HNSW and DiskANN (Vamana) for a hybrid graph index.
Supports incremental index construction and in-place deletes.
Offers Product Quantization (PQ), Binary Quantization (BQ), and Fused ADC for compression.
Two-pass search with optional reranking for accuracy and performance.
Capable of building larger-than-memory indexes.
Leverages Java's Panama Vector API (SIMD) for performance.

Maintenance & Community

Developed by DataStax.
Multi-module Maven build, targeting Java 11 compatibility with Java 20+ optimizations.
Community and support channels are not explicitly mentioned in the README.

Licensing & Compatibility

Apache License 2.0.
Compatible with commercial and closed-source applications.

Limitations & Caveats

Anisotropic PQ tuning is experimental and can degrade performance if misconfigured.
SimpleMappedReader for on-disk indexes is limited to 2GB file sizes; MemorySegmentReader requires Java 22+.
The README mentions potential memory bandwidth saturation during indexing and PQ, managed by PhysicalCoreExecutor.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days