Research paper on efficient LLM inference via contextual sparsity
Top 83.8% on sourcepulse
DejaVu addresses the high computational cost of Large Language Model (LLM) inference by introducing contextual sparsity, enabling significant speedups without compromising model quality or in-context learning. It targets researchers and engineers working with LLMs who need to reduce inference latency on modern hardware.
How It Works
DejaVu leverages the insight that only a small subset of attention heads and MLP parameters are necessary for accurate inference on a given input. It employs a low-cost, on-the-fly prediction algorithm to identify these "contextual" sparse components for each layer. This approach avoids costly retraining or sacrificing LLM capabilities, offering wall-clock time speedups.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is associated with FMInference and has a paper published at ICML 2023. No specific community channels (Discord/Slack) or active maintenance signals are explicitly mentioned in the README.
Licensing & Compatibility
The README does not explicitly state a license. The code is provided for research purposes. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The setup process is complex, requiring specific older versions of PyTorch and CUDA, and involves multiple data collection and training steps. The project relies on a separate repository (Decentralized_FM_alpha) for accuracy benchmarks. The latency benchmarking script for sparse MLP + sparse Attention blocks requires careful configuration of sparsity thresholds (mlp-K
, att-K1
, att-K2
).
1 year ago
1 week