minillm  by kuleshov

Minimal system for running LLMs on consumer GPUs (research project)

Created 2 years ago
926 stars

Top 39.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

MiniLLM provides a minimal, Python-centric system for running large language models (LLMs) on consumer-grade NVIDIA GPUs. It targets researchers and power users seeking an accessible platform for experimentation with LLMs, focusing on efficient inference and alignment research.

How It Works

MiniLLM leverages the GPTQ algorithm for model compression, enabling significant reductions in GPU memory usage. This approach allows larger models, up to 170B parameters, to run on hardware typically found in consumer setups. The system supports multiple LLM architectures, including LLAMA, BLOOM, and OPT, with a codebase designed for simplicity and ease of use.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by python setup.py install. A conda environment is recommended.
  • Prerequisites: Python 3.8+, PyTorch (tested with 1.13.1+cu116), NVIDIA GPU (Pascal architecture or newer), CUDA toolkit.
  • Setup: Requires compiling a custom CUDA kernel.
  • Models: Download weights using minillm download --model <model_name> --weights <weights_path>.
  • Docs: https://github.com/kuleshov/minillm

Highlighted Details

  • Supports LLAMA, BLOOM, and OPT models up to 170B parameters.
  • Achieves significant memory reduction via 3-bit GPTQ compression.
  • Demonstrates chain-of-thought reasoning capabilities on consumer GPUs.
  • Offers both command-line and programmatic interfaces.

Maintenance & Community

This is a research project from Cornell Tech and Cornell University. Feedback can be sent to Volodymyr Kuleshov.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, only NVIDIA GPUs are supported. The project is experimental and in progress, with plans to add support for more LLMs, automated quantization, and fine-tuning capabilities. Some generated outputs may require manual selection from multiple samples.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Ross Wightman Ross Wightman(Author of timm; CV at Hugging Face), Awni Hannun Awni Hannun(Author of MLX; Research Scientist at Apple), and
1 more.

mlx-llm by riccardomusmeci

0%
454
LLM tools/apps for Apple Silicon using MLX
Created 1 year ago
Updated 7 months ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 1 day ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anil Dash Anil Dash(Former CEO of Glitch), and
23 more.

llamafile by Mozilla-Ocho

0.1%
23k
Single-file LLM distribution and runtime via `llama.cpp` and Cosmopolitan Libc
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.