LayerSkip by facebookresearch

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding" research paper

Created 1 year ago

351 stars

Top 79.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

LayerSkip provides an implementation for early exit inference and self-speculative decoding in large language models, targeting researchers and developers seeking to accelerate LLM inference. It enables models to exit at earlier layers during generation, significantly reducing latency and computational cost while maintaining accuracy.

How It Works

LayerSkip introduces a novel approach to speculative decoding by allowing the model to exit at intermediate layers. This "early exit" strategy uses a draft model (an earlier version of the main model) to predict multiple tokens in parallel. These draft tokens are then verified by the main model. The advantage lies in amortizing the cost of draft token generation across multiple predictions, leading to substantial speedups, especially when the draft model's predictions are accurate.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt within a Python 3.10 Conda environment.
Models: Access LayerSkip-trained Llama and CodeLlama checkpoints on HuggingFace (e.g., facebook/layerskip-llama2-7B). Requires HuggingFace login and model access approval.
Demo: Run inference with torchrun generate.py --model <model_name> --generation_strategy self_speculative --exit_layer <layer_num> --num_speculations <num_tokens>.
Resources: Requires PyTorch and HuggingFace Transformers. Specific hardware requirements depend on the chosen LLM.
Docs: Hugging Face integration, PyTorch torchtune integration, Hugging Face trl integration.

Highlighted Details

Integrated into Hugging Face Transformers and PyTorch torchtune.
Supports self-speculative decoding for accelerated inference.
Includes scripts for benchmarking, evaluation (via EleutherAI LM Eval Harness), and hyperparameter sweeping.
Offers correctness verification for generated tokens when sampling is disabled.

Maintenance & Community

The project is from Meta AI (facebookresearch). Contributions are welcomed via a dedicated contribution document.

Licensing & Compatibility

Licensed under CC-by-NC. This license restricts commercial use and derivative works intended for commercial purposes.

Limitations & Caveats

The CC-by-NC license prohibits commercial use. Speedups are primarily observed with models trained using the LayerSkip recipe and when using self-speculative decoding; standard autoregressive decoding will not show speed benefits. Classification tasks will not yield speedups with this method.

LayerSkip by facebookresearch

Explore Similar Projects

dflash by z-lab

speculative-decoding by lucidrains

dots.llm1 by rednote-hilab

Sequoia by Infini-AI-Lab

Deepdive-llama3-from-scratch by therealoliver

Seed-Coder by ByteDance-Seed

SpeculativeDecodingPapers by hemingkx

cwm by facebookresearch

huggingface-llama-recipes by huggingface

EAGLE by SafeAILab

LLMLingua by microsoft

one-small-step by karminski