mlx-lm  by ml-explore

Python package for LLM text generation and fine-tuning on Apple silicon

created 4 months ago
1,469 stars

Top 28.4% on sourcepulse

GitHubView on GitHub
Project Summary

This package enables running and fine-tuning large language models (LLMs) on Apple Silicon using the MLX framework. It targets developers and researchers who want to leverage their Apple hardware for efficient LLM experimentation, offering seamless integration with Hugging Face Hub for model access and quantization capabilities.

How It Works

MLX LM leverages MLX, a GPU-accelerated array framework designed for Apple Silicon. It provides a Python API and command-line tools for loading, generating text with, and fine-tuning LLMs. Key features include support for model quantization (e.g., 4-bit) to reduce memory footprint and improve inference speed, efficient handling of long contexts via rotating KV caches and prompt caching, and distributed inference/fine-tuning using mx.distributed.

Quick Start & Requirements

  • Install via pip: pip install mlx-lm or conda: conda install -c conda-forge mlx-lm.
  • Requires macOS 13.0+ for basic functionality, and macOS 15.0+ for optimized handling of large models (wired memory).
  • Official documentation and examples are available.

Highlighted Details

  • Seamless integration with Hugging Face Hub for thousands of LLMs.
  • Supports low-rank and full model fine-tuning, including quantized models.
  • Enables model quantization and uploading to Hugging Face Hub.
  • Features efficient long prompt and generation handling with KV caching and prompt caching.
  • Supports distributed inference and fine-tuning.

Maintenance & Community

The project is maintained by the mlx-explore community. Links to community resources like Discord/Slack are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Performance with very large models may be slow if they exceed available RAM, though macOS 15+ offers optimizations. Some models (e.g., Qwen, plamo) require trust_remote_code=True and potentially specifying eos_token, which can introduce security considerations.

Health Check
Last commit

18 hours ago

Responsiveness

1 day

Pull Requests (30d)
48
Issues (30d)
30
Star History
922 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.