Inference framework for 1-bit LLMs
Top 2.2% on sourcepulse
This repository provides the official inference framework for 1-bit Large Language Models (LLMs), specifically BitNet b1.58. It enables fast and energy-efficient LLM inference on CPUs, with future support for NPUs and GPUs, targeting researchers and users who want to run LLMs locally on less powerful hardware.
How It Works
BitNet leverages optimized C++ kernels, building upon the llama.cpp framework and Lookup Table methodologies from T-MAC. This approach allows for lossless inference of 1.58-bit models, achieving significant speedups and energy reductions on CPUs by utilizing specialized quantization techniques.
Quick Start & Requirements
conda create -n bitnet-cpp python=3.9
, conda activate bitnet-cpp
), install dependencies (pip install -r requirements.txt
), and build the project.huggingface-cli download
to get models, then run python setup_env.py
to prepare them for inference.python run_inference.py -m <model_path> -p "Your prompt"
for text generation.Highlighted Details
Maintenance & Community
This project is based on llama.cpp. The latest updates include official 2B parameter models on Hugging Face and efficient edge inference for ternary LLMs.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
NPU and GPU support are listed as "coming next." The README mentions that tested models are dummy setups used in a research context, and some specific model configurations (e.g., BitNet-b1.58-3B) may not support all quantization types (e.g., i2_s
on x86).
2 months ago
1 week