SpikingBrain-7B by BICLab

Spiking brain-inspired LLMs utilize hybrid attention and sparsity

Created 4 months ago

1,238 stars

Top 31.7% on SourcePulse

Project Summary

SpikingBrain-7B is a large language model inspired by brain mechanisms, integrating hybrid attention, MoE, and spike encoding. It targets researchers and developers seeking efficient LLM training and inference, offering significant speedups and sparsity for long sequences, with potential applications in neuromorphic computing.

How It Works

The architecture integrates hybrid efficient attention, Mixture-of-Experts (MoE) modules, and a novel spike encoding mechanism inspired by brain functions. This design facilitates continual pre-training using minimal data (<2%) while matching mainstream model performance. It supports adaptation for non-NVIDIA clusters and achieves significant speedups (over 100x TTFT for 4M tokens) and high sparsity (over 69% micro-level) via its spiking activations and MoE sparsity, offering insights for neuromorphic chip design.

Quick Start & Requirements

Installation: Docker deployment is available for NVIDIA GPUs (docker.1ms.run/vllm/vllm-openai:v0.10.0). vLLM plugin installation requires cloning the repository and running pip install . within the vllm-hymeta directory. HuggingFace and quantized versions are loadable directly.
Prerequisites: NVIDIA GPUs are recommended for vLLM inference. Key dependencies include flash_attn==2.7.3, flash-linear-attention==0.1, vllm==0.10.0, torch==2.7.1, and standard Python build tools.
Resources: Setup involves repository cloning and package installation. Resource requirements scale with model size and sequence length.
Links: Technical Report (English/Chinese), arXiv:2509.05276, ModelScope weights (pre-trained, chat, quantized).

Highlighted Details

Achieves over 100x speedup in TTFT for 4M-token sequences.
Spiking activations provide over 69% sparsity at the micro-level, complemented by MoE sparsity.
Enables continual pre-training using less than 2% of the data.
Supports framework adaptations for non-NVIDIA (MetaX) clusters.
vLLM-HyMeta plugin facilitates modular inference integration on NVIDIA GPUs.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels (Discord/Slack) are provided in the README. A technical report and arXiv link are available for deeper insights.

Licensing & Compatibility

The README does not explicitly state the license type or compatibility restrictions.

Limitations & Caveats

The W8ASpike quantized version employs 'pseudo-spiking,' a tensor-level activation approximation, not true event-driven spiking. True-spiking functionality necessitates specific asynchronous hardware and event-driven operators, which are beyond this repository's scope. Performance benchmarks mention baselines trained on limited Chinese data, potentially impacting comparisons on other datasets.

SpikingBrain-7B by BICLab

Explore Similar Projects

native-sparse-attention-triton by XunhaoLai

FastV by pkunlp-icler

Quest by mit-han-lab

Block-Sparse-Attention by mit-han-lab

DejaVu by FMInference

omniserve by mit-han-lab

Kimi-Linear by MoonshotAI

SpikeGPT by ridgerchu

optiml by NU-QRG

Awesome-LLM-Inference by xlite-dev

Liger-Kernel by linkedin

PowerInfer by SJTU-IPADS