prompt-lookup-decoding  by apoorvumang

Decoding method for faster LLM generation

Created 1 year ago
566 stars

Top 56.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository introduces Prompt Lookup Decoding (PLD), a method to accelerate autoregressive decoding in large language models by leveraging n-gram matching within the prompt. It targets users performing input-grounded generation tasks like summarization, QA, and chat, offering significant speedups (2x-4x) without compromising output quality.

How It Works

PLD modifies speculative decoding by replacing the draft model with a string-matching function. This function identifies repeated n-grams in the prompt and uses them to predict candidate token sequences. By matching the last few tokens of the generated sequence against earlier parts of the prompt, PLD generates candidate continuations, effectively skipping multiple token generation steps when matches are found. This approach is model-agnostic and works with both greedy and sampling decoding strategies.

Quick Start & Requirements

  • Transformers Library: Add prompt_lookup_num_tokens=10 to your model.generate(...) call.
  • vLLM: Set speculative_model="[ngram]".
  • Dependencies: PyTorch, Transformers.
  • Resources: Tested on a single A100 40GB GPU with Mistral-7B-Instruct-v0.1.

Highlighted Details

  • Achieves 2.4x average speedup on summarization and context-QA tasks.
  • Demonstrates measurable gains in multi-turn chat, particularly in coding tasks.
  • No model architecture changes or external datastores are required.
  • Works with both greedy and sampling decoding methods.

Maintenance & Community

The project is associated with Apoorv Saxena. Further community engagement or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The current string-matching implementation may not be optimal, and strategies for handling multiple matches or determining ideal continuation lengths are still under exploration. Performance gains in roleplay scenarios are noted as lower due to less predictable output.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

Consistency_LLM by hao-ai-lab

0.3%
404
Parallel decoder for efficient LLM inference
Created 1 year ago
Updated 10 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Binyuan Hui Binyuan Hui(Research Scientist at Alibaba Qwen), and
3 more.

xgen by salesforce

0.1%
723
LLM research release with 8k sequence length
Created 2 years ago
Updated 7 months ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
5 more.

matmulfreellm by ridgerchu

0.0%
3k
MatMul-free language models
Created 1 year ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.