LLM semantic cache for reducing response time via cached query-result pairs
Top 38.9% on sourcepulse
A semantic caching system for large language models (LLMs) that reduces response times and inference costs by caching query-result pairs. It is designed for businesses and research institutions looking to optimize LLM service performance and scalability.
How It Works
ModelCache employs a modular architecture including adapter, embedding, similarity, and data management. The embedding module converts text into vector representations for similarity matching. The adapter module orchestrates business logic, integrating these components. Data is managed via scalar and vector storage, with recent updates including Redis Search for faster embedding retrieval and integration with various embedding frameworks like 'llmEmb', 'ONNX', and 'timm'.
Quick Start & Requirements
pip install -r requirements.txt
python flask4modelcache_demo.py
milvus_config.ini
and mysql_config.ini
, then run python flask4modelcache.py
.docker-compose up
(requires docker network create modelcache
first).Highlighted Details
Maintenance & Community
This project acknowledges inspiration from GPTCache. Contributions are welcomed via issues, suggestions, code, or documentation.
Licensing & Compatibility
The repository does not explicitly state a license in the README.
Limitations & Caveats
The project is actively under development with a "Todo List" including support for FastAPI, a visual interface, further inference optimization, and additional storage backends like MongoDB and Elasticsearch. Compatibility with specific inference engines like FasterTransformer is planned.
1 month ago
1+ week