LayerSkip  by facebookresearch

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding" research paper

Created 1 year ago
335 stars

Top 82.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LayerSkip provides an implementation for early exit inference and self-speculative decoding in large language models, targeting researchers and developers seeking to accelerate LLM inference. It enables models to exit at earlier layers during generation, significantly reducing latency and computational cost while maintaining accuracy.

How It Works

LayerSkip introduces a novel approach to speculative decoding by allowing the model to exit at intermediate layers. This "early exit" strategy uses a draft model (an earlier version of the main model) to predict multiple tokens in parallel. These draft tokens are then verified by the main model. The advantage lies in amortizing the cost of draft token generation across multiple predictions, leading to substantial speedups, especially when the draft model's predictions are accurate.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt within a Python 3.10 Conda environment.
  • Models: Access LayerSkip-trained Llama and CodeLlama checkpoints on HuggingFace (e.g., facebook/layerskip-llama2-7B). Requires HuggingFace login and model access approval.
  • Demo: Run inference with torchrun generate.py --model <model_name> --generation_strategy self_speculative --exit_layer <layer_num> --num_speculations <num_tokens>.
  • Resources: Requires PyTorch and HuggingFace Transformers. Specific hardware requirements depend on the chosen LLM.
  • Docs: Hugging Face integration, PyTorch torchtune integration, Hugging Face trl integration.

Highlighted Details

  • Integrated into Hugging Face Transformers and PyTorch torchtune.
  • Supports self-speculative decoding for accelerated inference.
  • Includes scripts for benchmarking, evaluation (via EleutherAI LM Eval Harness), and hyperparameter sweeping.
  • Offers correctness verification for generated tokens when sampling is disabled.

Maintenance & Community

The project is from Meta AI (facebookresearch). Contributions are welcomed via a dedicated contribution document.

Licensing & Compatibility

Licensed under CC-by-NC. This license restricts commercial use and derivative works intended for commercial purposes.

Limitations & Caveats

The CC-by-NC license prohibits commercial use. Speedups are primarily observed with models trained using the LayerSkip recipe and when using self-speculative decoding; standard autoregressive decoding will not show speed benefits. Classification tasks will not yield speedups with this method.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

recurrent-pretraining by seal-rg

0%
827
Pretraining code for depth-recurrent language model research
Created 7 months ago
Updated 1 week ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.