LLMLingua  by microsoft

Prompt compression for accelerated LLM inference

Created 2 years ago
5,436 stars

Top 9.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides LLMLingua, LongLLMLingua, and LLMLingua-2, a suite of tools for prompt compression to accelerate LLM inference and improve long-context understanding. Targeting developers and researchers working with LLMs, these tools offer significant cost savings and performance enhancements by reducing token usage with minimal impact on output quality.

How It Works

LLMLingua employs a compact, pre-trained language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression. LongLLMLingua specifically addresses the "lost in the middle" issue in long contexts by reordering and compressing information, improving RAG performance. LLMLingua-2 utilizes data distillation from larger models to create a task-agnostic compressor, offering faster performance and better out-of-domain handling.

Quick Start & Requirements

  • Install via pip: pip install llmlingua
  • Usage examples for LLMLingua, LongLLMLingua, and LLMLingua-2 are provided in the README.
  • Supports various models, including microsoft/phi-2 and quantized models like TheBloke/Llama-2-7b-Chat-GPTQ (requiring <8GB GPU memory).
  • Official documentation and demos are available.

Highlighted Details

  • Achieves up to 20x prompt compression with minimal performance loss.
  • Enhances RAG performance by up to 21.4% with LongLLMLingua.
  • LLMLingua-2 offers 3x-6x speed improvement over the original LLMLingua.
  • Integrations available for Prompt flow, LangChain, and LlamaIndex.

Maintenance & Community

  • Developed by Microsoft.
  • Active development with recent releases (SCBench, RetrievalAttention, MInference).
  • Contributions are welcomed via a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README text. Further investigation into the repository's license file is recommended for commercial use or closed-source linking.

Limitations & Caveats

  • The README does not explicitly state the license, which could be a blocker for commercial adoption.
  • While aiming for minimal performance loss, specific use cases may require validation.
Health Check
Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
95 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

prompt-lookup-decoding by apoorvumang

0.2%
566
Decoding method for faster LLM generation
Created 1 year ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.