tinyvector  by 0hq

Embedding database using SQLite and Pytorch

created 2 years ago
772 stars

Top 46.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a lightweight, embeddable vector database designed for small to medium datasets, targeting developers who find traditional vector databases overly complex for common use cases like document search or website product discovery. It aims to offer comparable speed to advanced solutions with a significantly simpler architecture and MIT licensing.

How It Works

Tinyvector utilizes a minimalist architecture comprising a Flask server, an SQLite database for data storage, and NumPy for indexing. It prioritizes in-memory indexing for fast querying, allowing vertical scaling to handle millions of vector dimensions. The project emphasizes ease of customization due to its small codebase.

Quick Start & Requirements

  • Install: pip install -r requirements
  • Run: python -m server
  • Testing: pip install pytest pytest-mock and pytest
  • Prerequisites: Python, Flask, NumPy, PyTorch (for GPU acceleration, not explicitly required for basic CPU operation).

Highlighted Details

  • Minimalist design with under 500 lines of Python code.
  • Planned integration of full SQL querying capabilities.
  • Future support for automatic embedding generation using models from Hugging Face, OpenAI, and Cohere.
  • Aims for comparable speed to advanced vector databases on small to medium datasets.

Maintenance & Community

The project is actively under development, with a stated goal of being production-ready by late July. Contributions are encouraged, with specific ideas for improvement listed, such as adding metadata filtering and GPU acceleration. Contact: @willdepue.

Licensing & Compatibility

MIT Licensed, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The project is explicitly marked as "in development" and "not ready." Known major bugs include potential data corruption where stored vectors change, possibly due to blob or norm functions. PCA and brute-force indexing are not yet tested. Metadata filtering is not currently supported but is a planned feature.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

pgvector-node by pgvector

0.8%
399
Node.js library for pgvector support
created 4 years ago
updated 2 weeks ago
Feedback? Help us improve.