ModelCache by codefuse-ai

LLM semantic cache for reducing response time via cached query-result pairs

Created 2 years ago

954 stars

Top 38.5% on SourcePulse

Project Summary

A semantic caching system for large language models (LLMs) that reduces response times and inference costs by caching query-result pairs. It is designed for businesses and research institutions looking to optimize LLM service performance and scalability.

How It Works

ModelCache employs a modular architecture including adapter, embedding, similarity, and data management. The embedding module converts text into vector representations for similarity matching. The adapter module orchestrates business logic, integrating these components. Data is managed via scalar and vector storage, with recent updates including Redis Search for faster embedding retrieval and integration with various embedding frameworks like 'llmEmb', 'ONNX', and 'timm'.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.8+, MySQL, Milvus (for standard service). Demo uses SQLite and FAISS. Requires downloading embedding model bin files from Hugging Face.
Demo: python flask4modelcache_demo.py
Standard Service: Configure milvus_config.ini and mysql_config.ini, then run python flask4modelcache.py.
Docker: docker-compose up (requires docker network create modelcache first).
Docs: Hugging Face for models.

Highlighted Details

Integrates with LLM products as a lightweight, Redis-like cache, compatible with all LLM services.
Supports local embedding model loading and multiple embedding layers.
Implements data isolation for development/production environments and multi-tenancy.
Differentiates between long and short text for similarity assessment.
Optimized Milvus consistency level to "Session" for performance.

Maintenance & Community

This project acknowledges inspiration from GPTCache. Contributions are welcomed via issues, suggestions, code, or documentation.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is actively under development with a "Todo List" including support for FastAPI, a visual interface, further inference optimization, and additional storage backends like MongoDB and Elasticsearch. Compatibility with specific inference engines like FasterTransformer is planned.

ModelCache by codefuse-ai

Explore Similar Projects

semantic-cache by upstash

nixiesearch by nixiesearch

tokasaurus by ScalingIntelligence

vattention by microsoft

mcp-apple-notes by RafalWilinski

site-rag by bracesproul

wllama by ngxson

candle-vllm by EricLBuehler

chroma-mcp by chroma-core

aphrodite-engine by aphrodite-engine

CAG by hhhuang

GPTCache by zilliztech