Decoding method for faster LLM generation
Top 58.5% on sourcepulse
This repository introduces Prompt Lookup Decoding (PLD), a method to accelerate autoregressive decoding in large language models by leveraging n-gram matching within the prompt. It targets users performing input-grounded generation tasks like summarization, QA, and chat, offering significant speedups (2x-4x) without compromising output quality.
How It Works
PLD modifies speculative decoding by replacing the draft model with a string-matching function. This function identifies repeated n-grams in the prompt and uses them to predict candidate token sequences. By matching the last few tokens of the generated sequence against earlier parts of the prompt, PLD generates candidate continuations, effectively skipping multiple token generation steps when matches are found. This approach is model-agnostic and works with both greedy and sampling decoding strategies.
Quick Start & Requirements
prompt_lookup_num_tokens=10
to your model.generate(...)
call.speculative_model="[ngram]"
.Highlighted Details
Maintenance & Community
The project is associated with Apoorv Saxena. Further community engagement or roadmap details are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source integration.
Limitations & Caveats
The current string-matching implementation may not be optimal, and strategies for handling multiple matches or determining ideal continuation lengths are still under exploration. Performance gains in roleplay scenarios are noted as lower due to less predictable output.
11 months ago
1 day