TriForce by Infini-AI-Lab

Framework for lossless acceleration of long sequence generation

Created 1 year ago

274 stars

Top 94.4% on SourcePulse

Project Summary

TriForce accelerates long sequence generation for large language models through hierarchical speculative decoding, offering a training-free approach to improve efficiency. It targets researchers and practitioners working with long-context models who need to reduce latency and computational cost.

How It Works

TriForce employs a hierarchical speculative decoding strategy. It uses a small, fast "draft" model to generate multiple candidate tokens, which are then verified by a larger, more capable "target" model. This process is structured hierarchically, allowing for efficient verification of sequences. The framework also supports offloading KV cache to CPU memory, managed with tensor parallelism and CUDA graphs for optimized performance across different hardware configurations.

Quick Start & Requirements

Install: conda create -n TriForce python=3.9, conda activate TriForce, pip install -r requirements.txt, pip install flash-attn --no-build-isolation.
Prerequisites: Python 3.9, CUDA 12.1 (for torch==2.2.1+cu121), flash_attn==2.5.7, transformers==4.37.2. Supports long-context Llama models (e.g., Llama2-7B-128K, LWM-Text-128K).
Resources: Requires significant GPU memory for on-chip operations and benefits from high PCIe bandwidth for offloading.
Docs: Paper, Blog

Highlighted Details

Achieves 2.2x speedup on a single A100 for 128K context length.
Supports offloading KV cache to CPU with tensor parallelism and CUDA Graph optimization.
Demonstrates performance on 2x RTX 4090s for offloading scenarios.
Includes baseline implementations for performance comparison.

Maintenance & Community

Authors affiliated with Carnegie Mellon University and Meta AI (FAIR).
Paper published on arXiv.
Issue #7 provides environment setup guidance.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

Compatibility issues noted with newer transformers versions; transformers==4.37.2 is required.
CUDA Graph support for tensor parallelism is limited to A100s, not RTX 4090s.
Offloading performance is sensitive to PCIe bandwidth.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days