Vector embedding pipeline for high-volume data ingestion
Top 49.8% on sourcepulse
VectorFlow is an open-source, high-throughput, fault-tolerant pipeline for ingesting raw data, transforming it into vector embeddings, and storing them in a vector database. It targets developers and researchers needing to process large volumes of text data for applications like semantic search or recommendation systems, offering a robust solution for embedding generation and management.
How It Works
VectorFlow processes data through a pipeline involving chunking, embedding generation, and storage. It supports various file types (TXT, PDF, DOCX, HTML) and offers configurable chunking strategies (token-based, with customizability via Python scripts). Embeddings can be generated using OpenAI models or HuggingFace Sentence Transformers. The pipeline is designed for high throughput and fault tolerance, leveraging message queues and containerization for reliability.
Quick Start & Requirements
git clone https://github.com/dgarnitz/vectorflow.git && cd vectorflow && ./setup.sh
pip install vectorflow-client
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
/embed
endpoint has a 25MB file size limit and may be deprecated.1 year ago
Inactive