vector-io  by AI-Northstar-Tech

Universal vector dataset tooling

created 2 years ago
251 stars

Top 99.8% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a universal interface for vector datasets, enabling seamless export, import, and re-embedding across various vector databases and RAG platforms. It targets developers and researchers working with large-scale vector data, offering a standardized format (VDF) to abstract away database-specific complexities and facilitate data migration and model experimentation.

How It Works

The core of vector-io is the Universal Vector Dataset Format (VDF), a standardized structure comprising a VDF_META.json file and associated Parquet files. This format decouples data from specific vector databases, allowing for agnostic operations. The library provides CLI tools (export_vdf, import_vdf, reembed_vdf) that leverage this format to translate data between different vector stores and to re-generate embeddings using specified models.

Quick Start & Requirements

  • Primary install: pip install vdf-io
  • Prerequisites: Python 3.x. Specific vector database clients may require additional setup.
  • Links: Contributing Guide, Examples

Highlighted Details

  • Supports import/export for Pinecone, Qdrant, Milvus, GCP Vertex AI Vector Search, KDB.AI, and LanceDB.
  • Offers a reembed_vdf utility to change embedding models without altering the vector store.
  • VDF specification includes metadata like model_name, dimensions, and metric for comprehensive dataset description.
  • Extensible design allows community contributions for new vector database integrations.

Maintenance & Community

  • Key contributors include Dhruv Anand and Jayesh Rathi.
  • Community interaction via GitHub Issues.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial or closed-source use is undetermined.

Limitations & Caveats

  • Import/export functionality is not yet implemented for many popular vector databases, including Weaviate, MongoDB Atlas, and Elasticsearch.
  • Telemetry is enabled by default and sends anonymous usage data, though it can be disabled via an environment variable.
Health Check
Last commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

NeumAI by NeumTry

0%
858
Data platform for retrieval-augmented generation (RAG)
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

vanna by vanna-ai

0.4%
20k
Python RAG framework for SQL generation
created 2 years ago
updated 3 months ago
Feedback? Help us improve.