lance  by lancedb

Columnar data format for ML/LLMs, implemented in Rust

created 3 years ago
5,135 stars

Top 9.9% on sourcepulse

GitHubView on GitHub
Project Summary

Lance is a modern columnar data format designed to optimize machine learning workflows, particularly for tasks like building search engines and feature stores, and large-scale ML training. It offers significant performance improvements over formats like Parquet for random access and integrates vector search capabilities directly into the data format.

How It Works

Lance is implemented in Rust and utilizes custom encodings and layouts to achieve fast columnar scans and sub-linear point queries. It stores nested fields as separate columns for efficient filtering. A key feature is its zero-copy, automatic versioning system, which allows for managing data snapshots without additional infrastructure. The format also supports rich secondary indices, including BTree, Bitmap, and full-text search.

Quick Start & Requirements

  • Install: pip install pylance
  • Preview releases: pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
  • Dependencies: Python 3.7+, DuckDB v0.7+ for certain query examples.
  • Documentation: https://lancedb.github.io/lance/

Highlighted Details

  • Claims 100x faster random access than Parquet.
  • Supports vector search with GPU (CUDA, MPS) and CPU acceleration.
  • Offers zero-copy, automatic data versioning.
  • Integrates with Pandas, DuckDB, Polars, PyArrow, and Ray.
  • Benchmarks show sub-millisecond average response times for vector search on a MacBook Air.

Maintenance & Community

Lance is in active development with contributions welcomed. It is used in production by LanceDB and several large-scale AI and self-driving car companies. Presentations and blog posts are available, including talks at Ray Summit and Scipy. Community channels include Discord and X.

Licensing & Compatibility

The project appears to be Apache 2.0 licensed, which is permissive for commercial use and closed-source linking.

Limitations & Caveats

Fast updates via write-ahead logs are listed as a roadmap item, implying current update performance may not be optimized. Some query examples mention potential segfaults if specific DuckDB versions are not installed, indicating potential integration fragility.

Health Check
Last commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
157
Issues (30d)
119
Star History
600 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Samuel Colvin Samuel Colvin(Author of Pydantic, Pydantic Logfire, PydanticAI), and
4 more.

quokka by marsupialtail

0.1%
1k
Distributed query engine for time series data
created 3 years ago
updated 11 months ago
Feedback? Help us improve.