vectorflow  by dgarnitz

Vector embedding pipeline for high-volume data ingestion

created 2 years ago
697 stars

Top 49.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VectorFlow is an open-source, high-throughput, fault-tolerant pipeline for ingesting raw data, transforming it into vector embeddings, and storing them in a vector database. It targets developers and researchers needing to process large volumes of text data for applications like semantic search or recommendation systems, offering a robust solution for embedding generation and management.

How It Works

VectorFlow processes data through a pipeline involving chunking, embedding generation, and storage. It supports various file types (TXT, PDF, DOCX, HTML) and offers configurable chunking strategies (token-based, with customizability via Python scripts). Embeddings can be generated using OpenAI models or HuggingFace Sentence Transformers. The pipeline is designed for high throughput and fault tolerance, leveraging message queues and containerization for reliability.

Quick Start & Requirements

  • Local Setup: git clone https://github.com/dgarnitz/vectorflow.git && cd vectorflow && ./setup.sh
  • Client Library: pip install vectorflow-client
  • Prerequisites: Docker, Docker Compose, Python 3.x. For local vector DBs, pull images for RabbitMQ, PostgreSQL, Min.io, and a chosen vector DB (Qdrant v1.9.1 recommended). Requires API keys for embedding models (e.g., OpenAI) and vector databases.
  • Resources: Docker Compose setup involves pulling multiple images and configuring environment variables.
  • Docs: https://github.com/dgarnitz/vectorflow#readme

Highlighted Details

  • Supports multiple vector databases: Pinecone, Qdrant, Weaviate.
  • Offers webhook integrations for raw embeddings and chunk validation.
  • Includes a Python client library for easy integration into applications.
  • Provides an S3 endpoint for processing pre-signed URLs.

Maintenance & Community

  • Active development with a roadmap outlining future features like multi-file ingestion and Langchain integration.
  • Community engagement encouraged via Discord.
  • Discord

Licensing & Compatibility

  • The project does not explicitly state a license in the provided README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

  • The current version is an MVP.
  • The /embed endpoint has a 25MB file size limit and may be deprecated.
  • Not recommended for use on Windows.
  • Standard metadata schema for vector stores is enforced but planned for dynamic configuration.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 1 day ago
Feedback? Help us improve.