vectorflow by dgarnitz

Vector embedding pipeline for high-volume data ingestion

Created 2 years ago

698 stars

Top 49.0% on SourcePulse

3 Experts Love This Project

jerryjliu

Cofounder of LlamaIndex

vincentweisser

Vincent Weisser

Cofounder of Prime Intellect

marcklingen

Cofounder of Langfuse

Project Summary

VectorFlow is an open-source, high-throughput, fault-tolerant pipeline for ingesting raw data, transforming it into vector embeddings, and storing them in a vector database. It targets developers and researchers needing to process large volumes of text data for applications like semantic search or recommendation systems, offering a robust solution for embedding generation and management.

How It Works

VectorFlow processes data through a pipeline involving chunking, embedding generation, and storage. It supports various file types (TXT, PDF, DOCX, HTML) and offers configurable chunking strategies (token-based, with customizability via Python scripts). Embeddings can be generated using OpenAI models or HuggingFace Sentence Transformers. The pipeline is designed for high throughput and fault tolerance, leveraging message queues and containerization for reliability.

Quick Start & Requirements

Local Setup: git clone https://github.com/dgarnitz/vectorflow.git && cd vectorflow && ./setup.sh
Client Library: pip install vectorflow-client
Prerequisites: Docker, Docker Compose, Python 3.x. For local vector DBs, pull images for RabbitMQ, PostgreSQL, Min.io, and a chosen vector DB (Qdrant v1.9.1 recommended). Requires API keys for embedding models (e.g., OpenAI) and vector databases.
Resources: Docker Compose setup involves pulling multiple images and configuring environment variables.
Docs: https://github.com/dgarnitz/vectorflow#readme

Highlighted Details

Supports multiple vector databases: Pinecone, Qdrant, Weaviate.
Offers webhook integrations for raw embeddings and chunk validation.
Includes a Python client library for easy integration into applications.
Provides an S3 endpoint for processing pre-signed URLs.

Maintenance & Community

Active development with a roadmap outlining future features like multi-file ingestion and Langchain integration.
Community engagement encouraged via Discord.
Discord

Licensing & Compatibility

The project does not explicitly state a license in the provided README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The current version is an MVP.
The /embed endpoint has a 25MB file size limit and may be deprecated.
Not recommended for use on Windows.
Standard metadata schema for vector stores is enforced but planned for dynamic configuration.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

radient by fzliu

Lightweight library for unstructured data ETL into embeddings

Created 1 year ago

Updated 2 weeks ago

Starred by

Will DePue

Will DePue(Coauthor of Sora).

tinyvector by m1guelpf

Embedding database in pure Rust

Created 2 years ago

Updated 2 years ago

Starred by

Deshraj Yadav

Deshraj Yadav(Cofounder of Mem0) and

Taranjeet Singh

Taranjeet Singh(Cofounder of Mem0).

embedchainjs by mem0ai

JavaScript framework for LLM-powered bots over any dataset

Created 2 years ago

Updated 2 years ago

Starred by

Krrish Dholakia

Krrish Dholakia(Cofounder of LiteLLM),

Philip Howes

Philip Howes(Cofounder of Baseten), and

1 more.

vlite by sdan

Fast vector database made in numpy

Created 2 years ago

Updated 3 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify) and

Anton Troynikov

Anton Troynikov(Cofounder of Chroma).

vectordb by epsilla-cloud

Vector database management system

Created 2 years ago

Updated 1 month ago

vectordb by kagisearch

Python package for local, embeddings-based text retrieval

Created 2 years ago

Updated 1 year ago

late-chunking by jina-ai

Research paper code for late chunking (chunked pooling) in embedding models

Created 1 year ago

Updated 1 year ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

2 more.

NeumAI by NeumTry

Data platform for retrieval-augmented generation (RAG)

Created 2 years ago

Updated 2 years ago

supavec by supavec

RAG-as-a-Service for any data source

Created 1 year ago

Updated 2 weeks ago

embedJs by llm-tools

NodeJS RAG framework for personalized LLM responses

Created 2 years ago

Updated 1 month ago

Starred by

Andre Zayarni

Andre Zayarni(Cofounder of Qdrant).

fastembed-rs by Anush008

Rust library for local vector embeddings and reranking

Created 2 years ago

Updated 1 day ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

dsRAG by D-Star-AI

RAG engine for unstructured data, excelling on dense text QA

Created 1 year ago

Updated 2 months ago

Feedback? Help us improve.