fastLLaMa by PotatoSpudowski

High-performance framework for running decoder-only LLMs with 4-bit quantization

Created 2 years ago

412 stars

Top 70.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

fastLLaMa provides a high-performance Python interface to the llama.cpp C++ library for running decoder-only LLMs. It targets developers needing efficient LLM deployment, offering features like 4-bit quantization, custom logging, and session state management for scalable production workflows.

How It Works

This framework leverages a C++ backend (llama.cpp) for core LLM operations, wrapped by a user-friendly Python API. It supports 4-bit quantization for reduced memory footprint and faster inference. Key features include in-memory system prompts, customizable logging, session state saving/loading, and dynamic LoRA adapter attachment/detachment with caching for performance.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main
Requirements: CMake, GCC 11+, C++17, Python 3.x.
Model conversion scripts are provided for LLaMA and Alpaca models.
See examples/python/ for detailed usage.

Highlighted Details

Supports LLaMA, Alpaca, GPT4All, and other derived models.
Dynamic LoRA adapter switching and quantization with caching.
Session state management (save/load/reset).
Custom logger implementation.
WebSocket server and Web UI planned.

Maintenance & Community

Active development with a roadmap including GPU Int4 support and multi-language bindings.
Contributions are welcomed via PRs.

Licensing & Compatibility

The project appears to use the MIT license, but the underlying llama.cpp has a permissive MIT license. Compatibility for commercial use is generally good.

Limitations & Caveats

Currently experimental, with features like NVIDIA GPU Int4 support, Windows, and Android support pending.
Models are fully loaded into RAM, requiring significant memory based on model size (e.g., 7B model requires ~3.9 GB quantized).

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days