yams  by trvon

Content-addressable storage for LLMs and applications

Created 5 months ago
360 stars

Top 77.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

YAMS is a content-addressable storage system designed for LLMs and applications, offering deduplication, full-text, and semantic search capabilities. It targets developers and researchers needing persistent, versioned, and easily searchable data storage, providing efficient data integrity and retrieval.

How It Works

YAMS utilizes SHA-256 hashing for content addressing, ensuring data integrity and immutability. Block-level deduplication is achieved via Rabin fingerprinting. It supports both full-text search using SQLite FTS5 and semantic search through vector embeddings. Crash recovery is managed with a write-ahead logging system, and the architecture is thread-safe, enabling high performance with reported throughputs exceeding 100MB/s.

Quick Start & Requirements

  • Installation: Docker (docker run --rm -it ghcr.io/trvon/yams:latest --version) or build from source using Conan (recommended).
  • Prerequisites: C++20 compiler (GCC 11+, Clang 14+), CMake 3.20+, Python 3.8+ (for Conan). macOS: brew install openssl@3 protobuf sqlite3 ncurses. Linux: apt install libssl-dev libsqlite3-dev protobuf-compiler libncurses-dev.
  • Setup: Initialize storage with yams init --non-interactive.
  • Docs: LLM Integration Guide, CLI Usage Examples

Highlighted Details

  • Content-addressed storage with SHA-256 hashing.
  • Block-level deduplication using Rabin fingerprinting.
  • Combined full-text (SQLite FTS5) and semantic search.
  • Write-ahead logging for crash recovery.
  • High performance: 100MB/s+ throughput.
  • Optional PDF text extraction support.

Maintenance & Community

The project is actively maintained by trvon. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

Licensed under Apache-2.0, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

Traditional CMake builds (without Conan) are noted to have dependency resolution issues; Conan builds are recommended. PDF extraction may fail if PDFium download is blocked by firewalls. Retrieval by name is listed as "coming soon."

Health Check
Last Commit

20 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
1 more.

rig by 0xPlaygrounds

1.7%
5k
Rust library for building LLM-powered applications
Created 1 year ago
Updated 21 hours ago
Feedback? Help us improve.