LayerSkip  by facebookresearch

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding" research paper

created 1 year ago
323 stars

Top 85.3% on sourcepulse

GitHubView on GitHub
Project Summary

LayerSkip provides an implementation for early exit inference and self-speculative decoding in large language models, targeting researchers and developers seeking to accelerate LLM inference. It enables models to exit at earlier layers during generation, significantly reducing latency and computational cost while maintaining accuracy.

How It Works

LayerSkip introduces a novel approach to speculative decoding by allowing the model to exit at intermediate layers. This "early exit" strategy uses a draft model (an earlier version of the main model) to predict multiple tokens in parallel. These draft tokens are then verified by the main model. The advantage lies in amortizing the cost of draft token generation across multiple predictions, leading to substantial speedups, especially when the draft model's predictions are accurate.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt within a Python 3.10 Conda environment.
  • Models: Access LayerSkip-trained Llama and CodeLlama checkpoints on HuggingFace (e.g., facebook/layerskip-llama2-7B). Requires HuggingFace login and model access approval.
  • Demo: Run inference with torchrun generate.py --model <model_name> --generation_strategy self_speculative --exit_layer <layer_num> --num_speculations <num_tokens>.
  • Resources: Requires PyTorch and HuggingFace Transformers. Specific hardware requirements depend on the chosen LLM.
  • Docs: Hugging Face integration, PyTorch torchtune integration, Hugging Face trl integration.

Highlighted Details

  • Integrated into Hugging Face Transformers and PyTorch torchtune.
  • Supports self-speculative decoding for accelerated inference.
  • Includes scripts for benchmarking, evaluation (via EleutherAI LM Eval Harness), and hyperparameter sweeping.
  • Offers correctness verification for generated tokens when sampling is disabled.

Maintenance & Community

The project is from Meta AI (facebookresearch). Contributions are welcomed via a dedicated contribution document.

Licensing & Compatibility

Licensed under CC-by-NC. This license restricts commercial use and derivative works intended for commercial purposes.

Limitations & Caveats

The CC-by-NC license prohibits commercial use. Speedups are primarily observed with models trained using the LayerSkip recipe and when using self-speculative decoding; standard autoregressive decoding will not show speed benefits. Classification tasks will not yield speedups with this method.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
34 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
685
HF Transformers accelerator for faster inference
created 1 year ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
1 more.

LookaheadDecoding by hao-ai-lab

0.1%
1k
Parallel decoding algorithm for faster LLM inference
created 1 year ago
updated 4 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Feedback? Help us improve.