CAG: RAG alternative using LLM context windows, research paper
Top 30.4% on sourcepulse
Cache-Augmented Generation (CAG) offers a retrieval-free alternative to Retrieval-Augmented Generation (RAG) for enhancing LLM responses with external knowledge. It targets users seeking reduced latency, improved reliability, and simplified system design compared to traditional RAG, by preloading relevant data into the LLM's context window and caching its KV-cache.
How It Works
CAG bypasses real-time retrieval by preloading all necessary external knowledge into the LLM's extended context window. During inference, the model utilizes its cached KV-cache, enabling direct generation without the overhead of a separate retrieval step. This approach aims to achieve comparable or superior results to RAG with a simpler architecture and lower latency.
Quick Start & Requirements
pip install -r ./requirements.txt
sh ./downloads.sh
), create .env
file with API keys.meta-llama/Llama-3.1-8B-Instruct
model, squad
and hotpotqa
datasets, bertscore
for similarity. GPU recommended for performance.docker build -t my-cag-app .
and docker run --gpus all -it --rm my-cag-app
(or CPU variant).python kvcache.py
for CAG, python rag.py
for RAG. See README for detailed parameter examples.Highlighted Details
Llama-3.1-8B-Instruct
and bertscore
for similarity.kvcache.py
) and RAG (rag.py
) experiments.Maintenance & Community
The project is associated with research presented at ACM Web Conference 2025. Acknowledgments mention support from Taiwan's National Science and Technology Council (NSTC) and Academia Sinica.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
CAG is limited by the LLM's context window size, making it less suitable for extremely large datasets. Performance may degrade with very long contexts, though ongoing LLM advancements are expected to mitigate this.
2 months ago
1 day