DejaVu by FMInference

Research paper on efficient LLM inference via contextual sparsity

Created 2 years ago

352 stars

Top 79.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pankaj Gupta

Cofounder of Baseten

Project Summary

DejaVu addresses the high computational cost of Large Language Model (LLM) inference by introducing contextual sparsity, enabling significant speedups without compromising model quality or in-context learning. It targets researchers and engineers working with LLMs who need to reduce inference latency on modern hardware.

How It Works

DejaVu leverages the insight that only a small subset of attention heads and MLP parameters are necessary for accurate inference on a given input. It employs a low-cost, on-the-fly prediction algorithm to identify these "contextual" sparse components for each layer. This approach avoids costly retraining or sacrificing LLM capabilities, offering wall-clock time speedups.

Quick Start & Requirements

Installation: Requires PyTorch 1.12.0+cu113, cupy-cuda11x==11.0.0, and NCCL for CUDA 11.x. Transformers library is also needed.
Prerequisites: CUDA 11.x is mandatory. Docker is recommended for latency benchmarking.
Setup: Involves collecting training data via shell scripts, converting Hugging Face checkpoints, and training sparsity predictors (attention and MLP).
Resources: Requires significant disk space for datasets (c4) and model checkpoints (e.g., OPT-175B).
Links: Paper: https://proceedings.mlr.press/v202/liu23am.html

Highlighted Details

Achieves over 2x latency reduction compared to FasterTransformer and over 6x compared to Hugging Face for OPT-175B.
Maintains model quality and in-context learning abilities.
Includes benchmarks for end-to-end accuracy (perplexity, downstream tasks) and generation latency.
Utilizes CUDA graphs for optimized latency benchmarking.

Maintenance & Community

The project is associated with FMInference and has a paper published at ICML 2023. No specific community channels (Discord/Slack) or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The setup process is complex, requiring specific older versions of PyTorch and CUDA, and involves multiple data collection and training steps. The project relies on a separate repository (Decentralized_FM_alpha) for accuracy benchmarks. The latency benchmarking script for sparse MLP + sparse Attention blocks requires careful configuration of sparsity thresholds (mlp-K, att-K1, att-K2).

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days