LLMLingua by microsoft

Prompt compression for accelerated LLM inference

Created 2 years ago

5,750 stars

Top 8.7% on SourcePulse

9 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

hammer

Jeff Hammerbacher

Cofounder of Cloudera

quincylarson

Founder of freeCodeCamp

bryanhelmig

Cofounder of Zapier

and 5 more!

Project Summary

This repository provides LLMLingua, LongLLMLingua, and LLMLingua-2, a suite of tools for prompt compression to accelerate LLM inference and improve long-context understanding. Targeting developers and researchers working with LLMs, these tools offer significant cost savings and performance enhancements by reducing token usage with minimal impact on output quality.

How It Works

LLMLingua employs a compact, pre-trained language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression. LongLLMLingua specifically addresses the "lost in the middle" issue in long contexts by reordering and compressing information, improving RAG performance. LLMLingua-2 utilizes data distillation from larger models to create a task-agnostic compressor, offering faster performance and better out-of-domain handling.

Quick Start & Requirements

Install via pip: pip install llmlingua
Usage examples for LLMLingua, LongLLMLingua, and LLMLingua-2 are provided in the README.
Supports various models, including microsoft/phi-2 and quantized models like TheBloke/Llama-2-7b-Chat-GPTQ (requiring <8GB GPU memory).
Official documentation and demos are available.

Highlighted Details

Achieves up to 20x prompt compression with minimal performance loss.
Enhances RAG performance by up to 21.4% with LongLLMLingua.
LLMLingua-2 offers 3x-6x speed improvement over the original LLMLingua.
Integrations available for Prompt flow, LangChain, and LlamaIndex.

Maintenance & Community

Developed by Microsoft.
Active development with recent releases (SCBench, RetrievalAttention, MInference).
Contributions are welcomed via a Contributor License Agreement (CLA).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README text. Further investigation into the repository's license file is recommended for commercial use or closed-source linking.

Limitations & Caveats

The README does not explicitly state the license, which could be a blocker for commercial adoption.
While aiming for minimal performance loss, specific use cases may require validation.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

1

Issues (30d)

1

Star History

86 stars in the last 30 days

Explore Similar Projects

C3-Context-Cascade-Compression by liufanfanlff

Advanced text compression model

Created 1 month ago

Updated 1 month ago

Toolkit-for-Prompt-Compression by 3DAgentWorld

Prompt compression toolkit for LLM inference efficiency

Created 1 year ago

Updated 11 months ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

llama-zip by AlexBuz

LLM-powered lossless compression tool

Created 1 year ago

Updated 1 week ago

Starred by

Alexander Wettig

Alexander Wettig(Coauthor of SWE-bench, SWE-agent).

AutoCompressors by princeton-nlp

Research paper adapting LMs for long context compression

Created 2 years ago

Updated 1 year ago

Starred by

Bryan Helmig

Bryan Helmig(Cofounder of Zapier).

llguidance by guidance-ai

Fast constrained decoding for LLMs

Created 1 year ago

Updated 1 month ago

Seed-Coder by ByteDance-Seed

Code LLM for code generation, completion, and reasoning tasks

Created 8 months ago

Updated 7 months ago

Orion by OrionStarAI

LLM family with foundation, chat, long context, quantized, RAG, and agent models

Created 2 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Cody Yu

Cody Yu(Coauthor of vLLM; MTS at OpenAI).

kvpress by NVIDIA

LLM KV cache compression made easy

Created 1 year ago

Updated 3 weeks ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and

8 more.

EAGLE by SafeAILab

Speculative decoding research paper for faster LLM inference

Created 2 years ago

Updated 3 weeks ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

Awesome-LLM-Inference by xlite-dev

Curated list of LLM/VLM inference research papers with code

Created 2 years ago

Updated 1 month ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Baichuan-7B by baichuan-inc

7B-parameter LLM for commercial use

Created 2 years ago

Updated 1 year ago

one-small-step by karminski

Tech tutorial project explaining AI concepts

Created 1 year ago

Updated 2 months ago

Feedback? Help us improve.