fastLLaMa  by PotatoSpudowski

High-performance framework for running decoder-only LLMs with 4-bit quantization

Created 2 years ago
413 stars

Top 70.8% on SourcePulse

GitHubView on GitHub
Project Summary

fastLLaMa provides a high-performance Python interface to the llama.cpp C++ library for running decoder-only LLMs. It targets developers needing efficient LLM deployment, offering features like 4-bit quantization, custom logging, and session state management for scalable production workflows.

How It Works

This framework leverages a C++ backend (llama.cpp) for core LLM operations, wrapped by a user-friendly Python API. It supports 4-bit quantization for reduced memory footprint and faster inference. Key features include in-memory system prompts, customizable logging, session state saving/loading, and dynamic LoRA adapter attachment/detachment with caching for performance.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main
  • Requirements: CMake, GCC 11+, C++17, Python 3.x.
  • Model conversion scripts are provided for LLaMA and Alpaca models.
  • See examples/python/ for detailed usage.

Highlighted Details

  • Supports LLaMA, Alpaca, GPT4All, and other derived models.
  • Dynamic LoRA adapter switching and quantization with caching.
  • Session state management (save/load/reset).
  • Custom logger implementation.
  • WebSocket server and Web UI planned.

Maintenance & Community

  • Active development with a roadmap including GPU Int4 support and multi-language bindings.
  • Contributions are welcomed via PRs.

Licensing & Compatibility

  • The project appears to use the MIT license, but the underlying llama.cpp has a permissive MIT license. Compatibility for commercial use is generally good.

Limitations & Caveats

  • Currently experimental, with features like NVIDIA GPU Int4 support, Windows, and Android support pending.
  • Models are fully loaded into RAM, requiring significant memory based on model size (e.g., 7B model requires ~3.9 GB quantized).
Health Check
Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

JittorLLMs by Jittor

0.0%
2k
Low-resource LLM inference library
Created 2 years ago
Updated 6 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

torchchat by pytorch

0.1%
4k
PyTorch-native SDK for local LLM inference across diverse platforms
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anil Dash Anil Dash(Former CEO of Glitch), and
23 more.

llamafile by Mozilla-Ocho

0.1%
23k
Single-file LLM distribution and runtime via `llama.cpp` and Cosmopolitan Libc
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.