Prompt compression for accelerated LLM inference
Top 9.6% on sourcepulse
This repository provides LLMLingua, LongLLMLingua, and LLMLingua-2, a suite of tools for prompt compression to accelerate LLM inference and improve long-context understanding. Targeting developers and researchers working with LLMs, these tools offer significant cost savings and performance enhancements by reducing token usage with minimal impact on output quality.
How It Works
LLMLingua employs a compact, pre-trained language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression. LongLLMLingua specifically addresses the "lost in the middle" issue in long contexts by reordering and compressing information, improving RAG performance. LLMLingua-2 utilizes data distillation from larger models to create a task-agnostic compressor, offering faster performance and better out-of-domain handling.
Quick Start & Requirements
pip install llmlingua
microsoft/phi-2
and quantized models like TheBloke/Llama-2-7b-Chat-GPTQ
(requiring <8GB GPU memory).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 months ago
Inactive