ModelCache  by codefuse-ai

LLM semantic cache for reducing response time via cached query-result pairs

created 1 year ago
967 stars

Top 38.9% on sourcepulse

GitHubView on GitHub
Project Summary

A semantic caching system for large language models (LLMs) that reduces response times and inference costs by caching query-result pairs. It is designed for businesses and research institutions looking to optimize LLM service performance and scalability.

How It Works

ModelCache employs a modular architecture including adapter, embedding, similarity, and data management. The embedding module converts text into vector representations for similarity matching. The adapter module orchestrates business logic, integrating these components. Data is managed via scalar and vector storage, with recent updates including Redis Search for faster embedding retrieval and integration with various embedding frameworks like 'llmEmb', 'ONNX', and 'timm'.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, MySQL, Milvus (for standard service). Demo uses SQLite and FAISS. Requires downloading embedding model bin files from Hugging Face.
  • Demo: python flask4modelcache_demo.py
  • Standard Service: Configure milvus_config.ini and mysql_config.ini, then run python flask4modelcache.py.
  • Docker: docker-compose up (requires docker network create modelcache first).
  • Docs: Hugging Face for models.

Highlighted Details

  • Integrates with LLM products as a lightweight, Redis-like cache, compatible with all LLM services.
  • Supports local embedding model loading and multiple embedding layers.
  • Implements data isolation for development/production environments and multi-tenancy.
  • Differentiates between long and short text for similarity assessment.
  • Optimized Milvus consistency level to "Session" for performance.

Maintenance & Community

This project acknowledges inspiration from GPTCache. Contributions are welcomed via issues, suggestions, code, or documentation.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is actively under development with a "Todo List" including support for FastAPI, a visual interface, further inference optimization, and additional storage backends like MongoDB and Elasticsearch. Compatibility with specific inference engines like FasterTransformer is planned.

Health Check
Last commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)
1
Issues (30d)
0
Star History
31 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.