prompt-lookup-decoding by apoorvumang

Decoding method for faster LLM generation

Created 2 years ago

584 stars

Top 55.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

and 1 more!

Project Summary

This repository introduces Prompt Lookup Decoding (PLD), a method to accelerate autoregressive decoding in large language models by leveraging n-gram matching within the prompt. It targets users performing input-grounded generation tasks like summarization, QA, and chat, offering significant speedups (2x-4x) without compromising output quality.

How It Works

PLD modifies speculative decoding by replacing the draft model with a string-matching function. This function identifies repeated n-grams in the prompt and uses them to predict candidate token sequences. By matching the last few tokens of the generated sequence against earlier parts of the prompt, PLD generates candidate continuations, effectively skipping multiple token generation steps when matches are found. This approach is model-agnostic and works with both greedy and sampling decoding strategies.

Quick Start & Requirements

Transformers Library: Add prompt_lookup_num_tokens=10 to your model.generate(...) call.
vLLM: Set speculative_model="[ngram]".
Dependencies: PyTorch, Transformers.
Resources: Tested on a single A100 40GB GPU with Mistral-7B-Instruct-v0.1.

Highlighted Details

Achieves 2.4x average speedup on summarization and context-QA tasks.
Demonstrates measurable gains in multi-turn chat, particularly in coding tasks.
No model architecture changes or external datastores are required.
Works with both greedy and sampling decoding methods.

Maintenance & Community

The project is associated with Apoorv Saxena. Further community engagement or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The current string-matching implementation may not be optimal, and strategies for handling multiple matches or determining ideal continuation lengths are still under exploration. Performance gains in roleplay scenarios are noted as lower due to less predictable output.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days