Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding" research paper
Top 85.3% on sourcepulse
LayerSkip provides an implementation for early exit inference and self-speculative decoding in large language models, targeting researchers and developers seeking to accelerate LLM inference. It enables models to exit at earlier layers during generation, significantly reducing latency and computational cost while maintaining accuracy.
How It Works
LayerSkip introduces a novel approach to speculative decoding by allowing the model to exit at intermediate layers. This "early exit" strategy uses a draft model (an earlier version of the main model) to predict multiple tokens in parallel. These draft tokens are then verified by the main model. The advantage lies in amortizing the cost of draft token generation across multiple predictions, leading to substantial speedups, especially when the draft model's predictions are accurate.
Quick Start & Requirements
pip install -r requirements.txt
within a Python 3.10 Conda environment.facebook/layerskip-llama2-7B
). Requires HuggingFace login and model access approval.torchrun generate.py --model <model_name> --generation_strategy self_speculative --exit_layer <layer_num> --num_speculations <num_tokens>
.Highlighted Details
Maintenance & Community
The project is from Meta AI (facebookresearch). Contributions are welcomed via a dedicated contribution document.
Licensing & Compatibility
Licensed under CC-by-NC. This license restricts commercial use and derivative works intended for commercial purposes.
Limitations & Caveats
The CC-by-NC license prohibits commercial use. Speedups are primarily observed with models trained using the LayerSkip recipe and when using self-speculative decoding; standard autoregressive decoding will not show speed benefits. Classification tasks will not yield speedups with this method.
3 months ago
1 week