Framework for lossless acceleration of long sequence generation
Top 98.0% on sourcepulse
TriForce accelerates long sequence generation for large language models through hierarchical speculative decoding, offering a training-free approach to improve efficiency. It targets researchers and practitioners working with long-context models who need to reduce latency and computational cost.
How It Works
TriForce employs a hierarchical speculative decoding strategy. It uses a small, fast "draft" model to generate multiple candidate tokens, which are then verified by a larger, more capable "target" model. This process is structured hierarchically, allowing for efficient verification of sequences. The framework also supports offloading KV cache to CPU memory, managed with tensor parallelism and CUDA graphs for optimized performance across different hardware configurations.
Quick Start & Requirements
conda create -n TriForce python=3.9
, conda activate TriForce
, pip install -r requirements.txt
, pip install flash-attn --no-build-isolation
.torch==2.2.1+cu121
), flash_attn==2.5.7
, transformers==4.37.2
. Supports long-context Llama models (e.g., Llama2-7B-128K, LWM-Text-128K).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
transformers
versions; transformers==4.37.2
is required.11 months ago
Inactive