bitsandbytes by bitsandbytes-foundation

PyTorch library for k-bit quantization, enabling accessible LLMs

Created 4 years ago

7,878 stars

Top 6.6% on SourcePulse

31 Experts Love This Project

tjbck

Founder of Open WebUI

alexchen4ai

Cofounder of Nexa AI

danielhanchen

Cofounder of Unsloth

shimmyshimmer

Cofounder of Unsloth

and 27 more!

Project Summary

bitsandbytes provides efficient k-bit quantization for large language models in PyTorch, enabling accessible deployment on consumer hardware. It targets researchers and developers working with LLMs who need to reduce memory footprint and improve inference speed.

How It Works

The library wraps custom CUDA functions for 8-bit optimizers, matrix multiplication (LLM.int8()), and 8- & 4-bit quantization. It offers bitsandbytes.nn.Linear8bitLt and bitsandbytes.nn.Linear4bit for quantization-aware layers and bitsandbytes.optim for 8-bit optimizers, reducing memory usage and potentially speeding up computations.

Quick Start & Requirements

Install via pip: pip install bitsandbytes
Requires PyTorch and CUDA-enabled GPU.
Official documentation: https://huggingface.co/docs/bitsandbytes/main

Highlighted Details

Enables 8-bit and 4-bit quantization for LLMs.
Includes 8-bit optimizers.
Supports LLM.int8() matrix multiplication.

Maintenance & Community

Ongoing efforts to support Intel CPU+GPU, AMD GPU, Apple Silicon, and NPUs.
Official documentation hosted on Hugging Face.

Licensing & Compatibility

MIT licensed.
Compatible with commercial and closed-source applications.

Limitations & Caveats

The library primarily targets NVIDIA GPUs with CUDA. Support for other hardware backends is under development and may not be production-ready.

Health Check

Last Commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)

7

Issues (30d)

0

Star History

83 stars in the last 30 days

Explore Similar Projects

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI).

Awesome-LLM-Quantization by pprp

Curated list of resources for LLM quantization research

Created 1 year ago

Updated 3 months ago

fp6_llm by usyd-fsalab

GPU-accelerated LLM inference via quantization

Created 1 year ago

Updated 5 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

EfficientQAT by OpenGVLab

PyTorch implementation for efficient quantization-aware training of LLMs

Created 1 year ago

Updated 1 month ago

Atom by efeslab

Low-bit quantization research paper for efficient LLM serving

Created 2 years ago

Updated 1 year ago

KVQuant by SqueezeAILab

Research paper on KV cache quantization for long context LLM inference

Created 1 year ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

1 more.

SqueezeLLM by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

Created 2 years ago

Updated 1 year ago

Starred by

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

quip-sharp by Cornell-RelaxML

LLM quantization for extreme compression

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

Code for a NeurIPS 2024 research paper on LLM quantization

Created 1 year ago

Updated 1 year ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

BitBLAS by microsoft

Library for mixed-precision matrix multiplications, targeting quantized LLM deployment

Created 1 year ago

Updated 5 months ago

Starred by

Zack Li

Zack Li(Cofounder of Nexa AI).

T-MAC by microsoft

Kernel library for low-bit LLM inference on CPUs using lookup tables

Created 1 year ago

Updated 7 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

7 more.

llm-awq by mit-han-lab

Weight quantization research paper for LLM compression/acceleration

Created 2 years ago

Updated 5 months ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

8 more.

PowerInfer by SJTU-IPADS

LLM inference engine for local deployment on consumer GPUs

Created 2 years ago

Updated 5 months ago

Feedback? Help us improve.