vector-io  by AI-Northstar-Tech

Universal vector dataset tooling

Created 2 years ago
263 stars

Top 97.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides a universal interface for vector datasets, enabling seamless export, import, and re-embedding across various vector databases and RAG platforms. It targets developers and researchers working with large-scale vector data, offering a standardized format (VDF) to abstract away database-specific complexities and facilitate data migration and model experimentation.

How It Works

The core of vector-io is the Universal Vector Dataset Format (VDF), a standardized structure comprising a VDF_META.json file and associated Parquet files. This format decouples data from specific vector databases, allowing for agnostic operations. The library provides CLI tools (export_vdf, import_vdf, reembed_vdf) that leverage this format to translate data between different vector stores and to re-generate embeddings using specified models.

Quick Start & Requirements

  • Primary install: pip install vdf-io
  • Prerequisites: Python 3.x. Specific vector database clients may require additional setup.
  • Links: Contributing Guide, Examples

Highlighted Details

  • Supports import/export for Pinecone, Qdrant, Milvus, GCP Vertex AI Vector Search, KDB.AI, and LanceDB.
  • Offers a reembed_vdf utility to change embedding models without altering the vector store.
  • VDF specification includes metadata like model_name, dimensions, and metric for comprehensive dataset description.
  • Extensible design allows community contributions for new vector database integrations.

Maintenance & Community

  • Key contributors include Dhruv Anand and Jayesh Rathi.
  • Community interaction via GitHub Issues.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

  • Import/export functionality is not yet implemented for many popular vector databases, including Weaviate, MongoDB Atlas, and Elasticsearch.
  • Telemetry is enabled by default and sends anonymous usage data, though it can be disabled via an environment variable.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Research Scientist at Apple; Professor at CMU) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0.3%
732
Scientific tool for latent space investigation
Created 2 years ago
Updated 5 months ago
Feedback? Help us improve.