Discover and explore top open-source AI tools and projects—updated daily.
MoonshotAIEfficient linear attention architecture accelerates long-context LLMs
New!
Top 38.7% on SourcePulse
Kimi Linear is a hybrid linear attention architecture designed to overcome the performance and efficiency limitations of traditional full attention mechanisms, particularly for long-context natural language processing tasks. It offers superior performance and hardware efficiency, enabling models to process significantly longer sequences with reduced computational overhead, benefiting researchers and developers working with extensive documents, code, or complex conversational histories.
How It Works
The core innovation is Kimi Delta Attention (KDA), a refined linear attention mechanism based on the gated delta rule. KDA optimizes the use of finite-state RNN memory through a more efficient gating mechanism. This is combined with a hybrid architecture featuring a 3:1 ratio of KDA to global MLA (Multi-Head Linear Attention), which significantly reduces memory usage and KV cache requirements (up to 75%) while maintaining or surpassing the quality of full attention models. This approach yields substantial speedups in decoding throughput and time per output token (TPOT).
Quick Start & Requirements
pip install -U fla-corefla-core >= 0.4.0. Inference requires Hugging Face Transformers.AutoModelForCausalLM.from_pretrained with trust_remote_code=True.--max-model-len 1048576).moonshotai/Kimi-Linear-48B-A3B-Instruct).Highlighted Details
Maintenance & Community
The project is associated with a large author list ("team2025kimi"), indicating significant research backing. No specific community channels (e.g., Discord, Slack) or direct links to roadmaps are provided in the README.
Licensing & Compatibility
The README does not specify a software license. This omission requires clarification for adoption decisions, especially concerning commercial use or integration into proprietary systems.
Limitations & Caveats
The use of trust_remote_code=True for both Hugging Face inference and vLLM deployment necessitates careful security review. The 48B parameter models represent a substantial hardware requirement for inference and fine-tuning. The README does not detail specific benchmarks for shorter contexts or other NLP tasks beyond those highlighted.
4 days ago
Inactive
microsoft
zai-org