BitNet by microsoft

Inference framework for 1-bit LLMs

Created 1 year ago

25,625 stars

Top 1.5% on SourcePulse

View on GitHub

13 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Elvis Saravia

Founder of DAIR.AI

Anton Bukov

Cofounder of 1inch Network

Chaoyu Yang

Founder of Bento

and 9 more!

Project Summary

This repository provides the official inference framework for 1-bit Large Language Models (LLMs), specifically BitNet b1.58. It enables fast and energy-efficient LLM inference on CPUs, with future support for NPUs and GPUs, targeting researchers and users who want to run LLMs locally on less powerful hardware.

How It Works

BitNet leverages optimized C++ kernels, building upon the llama.cpp framework and Lookup Table methodologies from T-MAC. This approach allows for lossless inference of 1.58-bit models, achieving significant speedups and energy reductions on CPUs by utilizing specialized quantization techniques.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n bitnet-cpp python=3.9, conda activate bitnet-cpp), install dependencies (pip install -r requirements.txt), and build the project.
Prerequisites: Python >= 3.9, CMake >= 3.22, Clang >= 18. For Windows, Visual Studio 2022 with "Desktop development with C++" and "C++ CMake tools for Windows" is required.
Model Download: Use huggingface-cli download to get models, then run python setup_env.py to prepare them for inference.
Inference: Execute python run_inference.py -m <model_path> -p "Your prompt" for text generation.
Resources: Official demo available. See README for detailed build instructions for different OS.

Highlighted Details

Achieves 1.37x-5.07x speedups on ARM CPUs and 2.37x-6.17x on x86 CPUs.
Reduces energy consumption by 55.4%-70.0% on ARM and 71.9%-82.2% on x86.
Supports running a 100B BitNet b1.58 model on a single CPU at 5-7 tokens/sec.
Includes scripts for benchmarking and generating dummy models for performance testing.

Maintenance & Community

This project is based on llama.cpp. The latest updates include official 2B parameter models on Hugging Face and efficient edge inference for ternary LLMs.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

NPU and GPU support are listed as "coming next." The README mentions that tested models are dummy setups used in a research context, and some specific model configurations (e.g., BitNet-b1.58-3B) may not support all quantization types (e.g., i2_s on x86).

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

1,282 stars in the last 30 days