lm.rs  by samuel-vitorino

Minimal LLM inference in Rust

Created 1 year ago
1,013 stars

Top 36.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a minimal, CPU-bound inference engine for large language models (LLMs) written in Rust. It targets developers and researchers seeking to run LLMs locally without heavy ML dependencies, offering support for Gemma, Llama 3.2, and PHI-3.5 (including multimodal capabilities) with quantized models for improved performance.

How It Works

The engine implements LLM inference directly in Rust, avoiding external ML libraries like PyTorch or TensorFlow. It leverages custom model conversion scripts to transform Hugging Face models into its own .lmrs format, supporting various quantization levels (e.g., Q8_0, Q4_0) for reduced memory footprint and faster inference. The core design prioritizes minimal dependencies and direct CPU execution, inspired by projects like llama2.c.

Quick Start & Requirements

  • Install Python dependencies: pip install -r requirements.txt
  • Convert models using python export.py and python tokenizer.py.
  • Compile Rust code: RUSTFLAGS="-C target-cpu=native" cargo build --release [--features multimodal]
  • Run inference: ./target/release/chat --model [model weights file]
  • WebUI backend: Compile with --features backend and run ./target/release/backend.
  • Requires Hugging Face model files (.safetensors, config.json, CLIP config for vision).

Highlighted Details

  • Supports Gemma 2, Llama 3.2, and PHI-3.5 (text and vision) models.
  • Achieves up to 50 tok/s on a 16-core AMD Epyc for Llama 3.2 1B Q8_0.
  • Implemented batch processing for up to 3x faster image encoding.
  • Offers quantization support (int8, int4) for reduced model size.

Maintenance & Community

  • Project is primarily maintained by a single author, with a disclaimer about code optimization.
  • Links to a WebUI, Hugging Face collection, and demo videos are provided.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is presented as an experimental learning exercise by the author, with some code potentially requiring optimization. Support for larger models (e.g., 27B) is noted as too slow for practical use on the author's hardware.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 18 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.