prompt-lookup-decoding  by apoorvumang

Decoding method for faster LLM generation

created 1 year ago
556 stars

Top 58.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository introduces Prompt Lookup Decoding (PLD), a method to accelerate autoregressive decoding in large language models by leveraging n-gram matching within the prompt. It targets users performing input-grounded generation tasks like summarization, QA, and chat, offering significant speedups (2x-4x) without compromising output quality.

How It Works

PLD modifies speculative decoding by replacing the draft model with a string-matching function. This function identifies repeated n-grams in the prompt and uses them to predict candidate token sequences. By matching the last few tokens of the generated sequence against earlier parts of the prompt, PLD generates candidate continuations, effectively skipping multiple token generation steps when matches are found. This approach is model-agnostic and works with both greedy and sampling decoding strategies.

Quick Start & Requirements

  • Transformers Library: Add prompt_lookup_num_tokens=10 to your model.generate(...) call.
  • vLLM: Set speculative_model="[ngram]".
  • Dependencies: PyTorch, Transformers.
  • Resources: Tested on a single A100 40GB GPU with Mistral-7B-Instruct-v0.1.

Highlighted Details

  • Achieves 2.4x average speedup on summarization and context-QA tasks.
  • Demonstrates measurable gains in multi-turn chat, particularly in coding tasks.
  • No model architecture changes or external datastores are required.
  • Works with both greedy and sampling decoding methods.

Maintenance & Community

The project is associated with Apoorv Saxena. Further community engagement or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The current string-matching implementation may not be optimal, and strategies for handling multiple matches or determining ideal continuation lengths are still under exploration. Performance gains in roleplay scenarios are noted as lower due to less predictable output.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

xgen by salesforce

0%
720
LLM research release with 8k sequence length
created 2 years ago
updated 6 months ago
Feedback? Help us improve.