HuggingFace Transformers implementation of BitNet scaling for LLMs
Top 88.8% on sourcepulse
This repository provides a PyTorch implementation of BitNet, a 1-bit transformer architecture for large language models, integrated with the Hugging Face Transformers library. It targets researchers and engineers looking to explore significant memory savings and potential performance gains in LLMs by quantizing weights to 1-bit.
How It Works
The core innovation is the BitLinear
layer, which replaces standard linear layers in the Llama 2 architecture. This layer quantizes weights to 1-bit, drastically reducing memory footprint while employing a mixed-precision approach for activations and other parameters to maintain performance. The implementation integrates directly into Hugging Face's Llama model by patching the modeling_llama.py
file.
Quick Start & Requirements
pip install -r clm_requirements.txt
), clone the Hugging Face Transformers repo, and install it in editable mode (pip install -e transformers
). Then, replace the original Llama modeling file with the BitNet version.Highlighted Details
uint8
and custom CUDA kernels for 1-bit weights.Maintenance & Community
The project is maintained by Beomi. There are no explicit links to community channels or roadmaps provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility with commercial or closed-source projects is not specified.
Limitations & Caveats
The implementation is still under active development, with several planned updates including full 1-bit weight usage and custom CUDA kernels. The current version uses a mixed-precision approach for weights.
1 year ago
1 day