SepLLM  by HKUDS

SepLLM accelerates LLMs by compressing segments into separators

Created 9 months ago
548 stars

Top 58.3% on SourcePulse

GitHubView on GitHub
Project Summary

SepLLM offers a method to accelerate Large Language Models (LLMs) by compressing segments of text into separator tokens, reducing computational demands and inference speed. It targets researchers and practitioners seeking to improve LLM efficiency, with a plug-and-play framework and efficient kernels for training acceleration.

How It Works

SepLLM leverages the observation that certain separator tokens (like punctuation) disproportionately contribute to attention scores. It compresses information from segments between these separators into the separators themselves, effectively eliminating redundant tokens. This approach aims to maintain performance while significantly reducing the KV cache size and speeding up inference.

Quick Start & Requirements

  • Installation: Requires installing a custom transformers wheel package (transformers-4.38.0.post1+sepllm-py3-none-any.whl) and potentially other dependencies like flash-attn and lm_eval.
  • Environment: Recommended environment includes Python 3.10, PyTorch 2.5.1 with CUDA 12.1, and DeepSpeed. Specific versions are crucial for certain features.
  • Setup: Involves creating a conda environment, installing packages, and potentially setting up symbolic links for code modification.
  • Resources: Training requires significant computational resources, including multi-node clusters and shared file systems.
  • Documentation: Detailed usage instructions for streaming, training-free, and training modes are provided.

Highlighted Details

  • Achieves over 50% KV cache reduction on GSM8K-CoT with Llama-3-8B while maintaining performance.
  • Supports processing sequences up to 4 million tokens in streaming settings.
  • Offers a plug-and-play SepCache class, now available on HuggingFace's transformers repository (requires transformers>=4.53.0,<4.54.0).
  • Includes support for various acceleration techniques like BiPE and Self-Adjust Softmax.

Maintenance & Community

  • The project is associated with ICML 2025.
  • Recent updates (July 2025) focus on the integration of SepCache into HuggingFace's transformers.
  • Codebase includes components from GPT-NeoX and Pythia projects.

Licensing & Compatibility

  • The repository contains a LICENSE file, but the specific license type is not detailed in the README.
  • The custom transformers wheel is based on transformers-4.38.0, and compatibility with newer transformers versions (e.g., Llama 3.1) may require manual code adaptation.

Limitations & Caveats

  • The Streaming-SepLLM branch requires positional encoding shifting, which is not applicable to general training-free tasks.
  • Using flash_attention_2 with SepCache is demonstrated but not mandatory for SepCache usage.
  • Training from scratch is recommended for optimal performance but requires substantial computational resources and careful environment setup.
Health Check
Last Commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
Created 2 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

3.5%
5k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 17 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
36 more.

unsloth by unslothai

0.6%
46k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 16 hours ago
Feedback? Help us improve.