SepLLM by HKUDS

SepLLM accelerates LLMs by compressing segments into separators

Created 1 year ago

559 stars

Top 57.3% on SourcePulse

Project Summary

SepLLM offers a method to accelerate Large Language Models (LLMs) by compressing segments of text into separator tokens, reducing computational demands and inference speed. It targets researchers and practitioners seeking to improve LLM efficiency, with a plug-and-play framework and efficient kernels for training acceleration.

How It Works

SepLLM leverages the observation that certain separator tokens (like punctuation) disproportionately contribute to attention scores. It compresses information from segments between these separators into the separators themselves, effectively eliminating redundant tokens. This approach aims to maintain performance while significantly reducing the KV cache size and speeding up inference.

Quick Start & Requirements

Installation: Requires installing a custom transformers wheel package (transformers-4.38.0.post1+sepllm-py3-none-any.whl) and potentially other dependencies like flash-attn and lm_eval.
Environment: Recommended environment includes Python 3.10, PyTorch 2.5.1 with CUDA 12.1, and DeepSpeed. Specific versions are crucial for certain features.
Setup: Involves creating a conda environment, installing packages, and potentially setting up symbolic links for code modification.
Resources: Training requires significant computational resources, including multi-node clusters and shared file systems.
Documentation: Detailed usage instructions for streaming, training-free, and training modes are provided.

Highlighted Details

Achieves over 50% KV cache reduction on GSM8K-CoT with Llama-3-8B while maintaining performance.
Supports processing sequences up to 4 million tokens in streaming settings.
Offers a plug-and-play SepCache class, now available on HuggingFace's transformers repository (requires transformers>=4.53.0,<4.54.0).
Includes support for various acceleration techniques like BiPE and Self-Adjust Softmax.

Maintenance & Community

The project is associated with ICML 2025.
Recent updates (July 2025) focus on the integration of SepCache into HuggingFace's transformers.
Codebase includes components from GPT-NeoX and Pythia projects.

Licensing & Compatibility

The repository contains a LICENSE file, but the specific license type is not detailed in the README.
The custom transformers wheel is based on transformers-4.38.0, and compatibility with newer transformers versions (e.g., Llama 3.1) may require manual code adaptation.

Limitations & Caveats

The Streaming-SepLLM branch requires positional encoding shifting, which is not applicable to general training-free tasks.
Using flash_attention_2 with SepCache is demonstrated but not mandatory for SepCache usage.
Training from scratch is recommended for optimal performance but requires substantial computational resources and careful environment setup.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

KVSplit by dipampaul17

CLI tool for differentiated KV cache quantization on Apple Silicon

Created 8 months ago

Updated 7 months ago

SnapKV by FasterDecoding

KV cache compression research paper

Created 1 year ago

Updated 6 months ago

C2C by thu-nics

Direct LLM communication via KV-Cache projection

Created 3 months ago

Updated 4 days ago

KIVI by jy-yuan

Research paper implementation for KV cache quantization

Created 1 year ago

Updated 1 month ago

semikong by aitomatic

Semiconductor LLM for domain-specific tasks

Created 1 year ago

Updated 8 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

1 more.

H2O by FMInference

KV cache eviction research paper for efficient LLM inference

Created 2 years ago

Updated 1 year ago

KVCache-Factory by Zefan-Cai

Unified framework for KV cache compression in auto-regressive models

Created 1 year ago

Updated 1 year ago

R-KV by Zefan-Cai

KV cache compression for reasoning models

Created 7 months ago

Updated 2 months ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

5 more.

streaming-llm by mit-han-lab

Framework for efficient LLM streaming

Created 2 years ago

Updated 1 year ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

12 more.

Liger-Kernel by linkedin

Triton kernels for efficient LLM training

Created 1 year ago

Updated 4 days ago

Starred by

Taranjeet Singh

Taranjeet Singh(Cofounder of Mem0),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

4 more.

LMCache by LMCache

LLM serving engine extension for reduced TTFT and increased throughput

Created 1 year ago

Updated 14 hours ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and

41 more.

unsloth by unslothai

Finetuning tool for LLMs, targeting speed and memory efficiency

Created 2 years ago

Updated 1 day ago

Feedback? Help us improve.