HF Transformers accelerator for faster inference
Top 50.5% on sourcepulse
GPTFast provides techniques to accelerate Hugging Face Transformers models, targeting researchers and engineers seeking faster inference. It generalizes optimizations originally developed for Llama-2-7b to all Hugging Face models, offering significant speedups through methods like static key-value caching and speculative decoding.
How It Works
GPTFast integrates optimizations by modifying model forward passes and attention mechanisms. It uses a cache_config
dictionary to specify how to inject static key-value caches, referencing model components and their forward pass logic. For speculative decoding, it requires a smaller "draft" model to generate candidate tokens, which are then verified by the main model, reducing computational load.
Quick Start & Requirements
pip install gptfast
gpt_fast
, load_int8
, add_kv_cache
, and add_speculative_decoding
are available in the README.Highlighted Details
torch.compile
, int8 quantization, and speculative decoding.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided documentation is marked as deprecated, with new documentation pending. The cache_config
requires detailed knowledge of the target model's internal structure for customization.
11 months ago
1 week