tinyvector by 0hq

Embedding database using SQLite and Pytorch

Created 2 years ago

771 stars

Top 45.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This project provides a lightweight, embeddable vector database designed for small to medium datasets, targeting developers who find traditional vector databases overly complex for common use cases like document search or website product discovery. It aims to offer comparable speed to advanced solutions with a significantly simpler architecture and MIT licensing.

How It Works

Tinyvector utilizes a minimalist architecture comprising a Flask server, an SQLite database for data storage, and NumPy for indexing. It prioritizes in-memory indexing for fast querying, allowing vertical scaling to handle millions of vector dimensions. The project emphasizes ease of customization due to its small codebase.

Quick Start & Requirements

Install: pip install -r requirements
Run: python -m server
Testing: pip install pytest pytest-mock and pytest
Prerequisites: Python, Flask, NumPy, PyTorch (for GPU acceleration, not explicitly required for basic CPU operation).

Highlighted Details

Minimalist design with under 500 lines of Python code.
Planned integration of full SQL querying capabilities.
Future support for automatic embedding generation using models from Hugging Face, OpenAI, and Cohere.
Aims for comparable speed to advanced vector databases on small to medium datasets.

Maintenance & Community

The project is actively under development, with a stated goal of being production-ready by late July. Contributions are encouraged, with specific ideas for improvement listed, such as adding metadata filtering and GPU acceleration. Contact: @willdepue.

Licensing & Compatibility

MIT Licensed, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The project is explicitly marked as "in development" and "not ready." Known major bugs include potential data corruption where stored vectors change, possibly due to blob or norm functions. PCA and brute-force indexing are not yet tested. Metadata filtering is not currently supported but is a planned feature.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days