mlx-bitnet by exo-explore

Efficient LLM inference on Apple Silicon

Created 2 years ago

289 stars

Top 90.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Wing Lian

Founder of Axolotl AI

Alex Cheema

Cofounder of EXO Labs

Project Summary

Summary

This repository provides an implementation of the 1.58-bit BitNet Large Language Model optimized for Apple Silicon using the MLX framework. It targets researchers and developers seeking highly efficient LLMs, offering significant improvements in speed and memory usage compared to traditional models like Llama, while maintaining competitive or superior performance metrics.

How It Works

The implementation leverages the BitNet architecture, which replaces standard floating-point weights in linear layers with low-bit representations. Specifically, it uses ternary weights (-1, 0, 1), approximating 1.58 bits per weight ($\log_2(3) \approx 1.58$). This approach, combined with the MLX array framework designed for Apple Silicon, enables substantial reductions in computational cost and memory footprint. The design aims for a Pareto improvement, delivering better performance across multiple dimensions simultaneously.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Python environment, MLX, and model weights downloaded from Hugging Face (e.g., from 1bitLLM). The weight conversion script (convert.py) is provided.
Setup: Requires downloading large model weight files and running a conversion script, which may take considerable time.
Verification: Run interoperability tests using python test_interop.py. Long-running tests are skipped by default and require manual removal of @unittest.skip decorators within the test file.
Resources: Primarily targets Apple Silicon hardware due to the MLX dependency.

Highlighted Details

Achieves 1.58-bit LLM inference on Apple Silicon.
Reports perplexity scores better than Llama.
Offers faster inference speeds and reduced memory requirements compared to Llama.
Based on Microsoft Research's advancements in low-bit LLM architectures.

Maintenance & Community

The project acknowledges contributions from 1bitLLM on Hugging Face, Nous Research, Awni Hannun, and the MLX contributors. No specific community channels (like Discord/Slack) or detailed roadmap beyond the listed "In Progress" and "Not Started" items are detailed in the provided text.

Licensing & Compatibility

The specific open-source license is not stated in the provided README text. Compatibility is focused on Apple Silicon platforms due to the reliance on the MLX framework.

Limitations & Caveats

The project is actively under development, with features like optimized kernels, Python training, and Swift inference for mobile platforms listed as "In Progress." Core functionalities like demo apps and efficient storage formats are "Not Started." Running extended tests requires manual configuration. The setup process involves downloading and converting large model files.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days