Discover and explore top open-source AI tools and projects—updated daily.
HKUDSSepLLM accelerates LLMs by compressing segments into separators
Top 57.9% on SourcePulse
SepLLM offers a method to accelerate Large Language Models (LLMs) by compressing segments of text into separator tokens, reducing computational demands and inference speed. It targets researchers and practitioners seeking to improve LLM efficiency, with a plug-and-play framework and efficient kernels for training acceleration.
How It Works
SepLLM leverages the observation that certain separator tokens (like punctuation) disproportionately contribute to attention scores. It compresses information from segments between these separators into the separators themselves, effectively eliminating redundant tokens. This approach aims to maintain performance while significantly reducing the KV cache size and speeding up inference.
Quick Start & Requirements
transformers wheel package (transformers-4.38.0.post1+sepllm-py3-none-any.whl) and potentially other dependencies like flash-attn and lm_eval.Highlighted Details
SepCache class, now available on HuggingFace's transformers repository (requires transformers>=4.53.0,<4.54.0).Maintenance & Community
SepCache into HuggingFace's transformers.Licensing & Compatibility
LICENSE file, but the specific license type is not detailed in the README.transformers wheel is based on transformers-4.38.0, and compatibility with newer transformers versions (e.g., Llama 3.1) may require manual code adaptation.Limitations & Caveats
Streaming-SepLLM branch requires positional encoding shifting, which is not applicable to general training-free tasks.flash_attention_2 with SepCache is demonstrated but not mandatory for SepCache usage.3 months ago
1 week
FMInference
mit-han-lab
linkedin
LMCache
unslothai