Kimi-Linear  by MoonshotAI

Efficient linear attention architecture accelerates long-context LLMs

Created 6 days ago

New!

947 stars

Top 38.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Kimi Linear is a hybrid linear attention architecture designed to overcome the performance and efficiency limitations of traditional full attention mechanisms, particularly for long-context natural language processing tasks. It offers superior performance and hardware efficiency, enabling models to process significantly longer sequences with reduced computational overhead, benefiting researchers and developers working with extensive documents, code, or complex conversational histories.

How It Works

The core innovation is Kimi Delta Attention (KDA), a refined linear attention mechanism based on the gated delta rule. KDA optimizes the use of finite-state RNN memory through a more efficient gating mechanism. This is combined with a hybrid architecture featuring a 3:1 ratio of KDA to global MLA (Multi-Head Linear Attention), which significantly reduces memory usage and KV cache requirements (up to 75%) while maintaining or surpassing the quality of full attention models. This approach yields substantial speedups in decoding throughput and time per output token (TPOT).

Quick Start & Requirements

  • Installation: pip install -U fla-core
  • Prerequisites: Python >= 3.10, PyTorch >= 2.6, fla-core >= 0.4.0. Inference requires Hugging Face Transformers.
  • Usage: Models can be loaded via AutoModelForCausalLM.from_pretrained with trust_remote_code=True.
  • Deployment: vLLM can be used to serve models with OpenAI-compatible API endpoints, supporting up to 1M tokens (--max-model-len 1048576).
  • Links: Hugging Face model pages (e.g., moonshotai/Kimi-Linear-48B-A3B-Instruct).

Highlighted Details

  • Achieves 51.0 performance on MMLU-Pro (4k context) with similar speed to full attention.
  • Delivers Pareto-optimal performance (84.3) on RULER (128k context) with a 3.98x speedup.
  • Offers 6.3x faster TPOT compared to MLA for sequences up to 1M tokens.
  • Reduces KV cache needs by up to 75% and boosts decoding throughput by up to 6x for 1M token contexts.
  • Models trained on 5.7T tokens are available, including 48B parameter versions with 3B activated parameters and 1M context length.

Maintenance & Community

The project is associated with a large author list ("team2025kimi"), indicating significant research backing. No specific community channels (e.g., Discord, Slack) or direct links to roadmaps are provided in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires clarification for adoption decisions, especially concerning commercial use or integration into proprietary systems.

Limitations & Caveats

The use of trust_remote_code=True for both Hugging Face inference and vLLM deployment necessitates careful security review. The 48B parameter models represent a substantial hardware requirement for inference and fine-tuning. The README does not detail specific benchmarks for shorter contexts or other NLP tasks beyond those highlighted.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
983 stars in the last 6 days

Explore Similar Projects

Feedback? Help us improve.