SpikingBrain-7B  by BICLab

Spiking brain-inspired LLMs utilize hybrid attention and sparsity

Created 2 weeks ago

New!

646 stars

Top 51.5% on SourcePulse

GitHubView on GitHub
Project Summary

SpikingBrain-7B is a large language model inspired by brain mechanisms, integrating hybrid attention, MoE, and spike encoding. It targets researchers and developers seeking efficient LLM training and inference, offering significant speedups and sparsity for long sequences, with potential applications in neuromorphic computing.

How It Works

The architecture integrates hybrid efficient attention, Mixture-of-Experts (MoE) modules, and a novel spike encoding mechanism inspired by brain functions. This design facilitates continual pre-training using minimal data (<2%) while matching mainstream model performance. It supports adaptation for non-NVIDIA clusters and achieves significant speedups (over 100x TTFT for 4M tokens) and high sparsity (over 69% micro-level) via its spiking activations and MoE sparsity, offering insights for neuromorphic chip design.

Quick Start & Requirements

  • Installation: Docker deployment is available for NVIDIA GPUs (docker.1ms.run/vllm/vllm-openai:v0.10.0). vLLM plugin installation requires cloning the repository and running pip install . within the vllm-hymeta directory. HuggingFace and quantized versions are loadable directly.
  • Prerequisites: NVIDIA GPUs are recommended for vLLM inference. Key dependencies include flash_attn==2.7.3, flash-linear-attention==0.1, vllm==0.10.0, torch==2.7.1, and standard Python build tools.
  • Resources: Setup involves repository cloning and package installation. Resource requirements scale with model size and sequence length.
  • Links: Technical Report (English/Chinese), arXiv:2509.05276, ModelScope weights (pre-trained, chat, quantized).

Highlighted Details

  • Achieves over 100x speedup in TTFT for 4M-token sequences.
  • Spiking activations provide over 69% sparsity at the micro-level, complemented by MoE sparsity.
  • Enables continual pre-training using less than 2% of the data.
  • Supports framework adaptations for non-NVIDIA (MetaX) clusters.
  • vLLM-HyMeta plugin facilitates modular inference integration on NVIDIA GPUs.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels (Discord/Slack) are provided in the README. A technical report and arXiv link are available for deeper insights.

Licensing & Compatibility

The README does not explicitly state the license type or compatibility restrictions.

Limitations & Caveats

The W8ASpike quantized version employs 'pseudo-spiking,' a tensor-level activation approximation, not true event-driven spiking. True-spiking functionality necessitates specific asynchronous hardware and event-driven operators, which are beyond this repository's scope. Performance benchmarks mention baselines trained on limited Chinese data, potentially impacting comparisons on other datasets.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
11
Star History
658 stars in the last 15 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.