High-performance framework for running decoder-only LLMs with 4-bit quantization
Top 72.3% on sourcepulse
fastLLaMa provides a high-performance Python interface to the llama.cpp
C++ library for running decoder-only LLMs. It targets developers needing efficient LLM deployment, offering features like 4-bit quantization, custom logging, and session state management for scalable production workflows.
How It Works
This framework leverages a C++ backend (llama.cpp
) for core LLM operations, wrapped by a user-friendly Python API. It supports 4-bit quantization for reduced memory footprint and faster inference. Key features include in-memory system prompts, customizable logging, session state saving/loading, and dynamic LoRA adapter attachment/detachment with caching for performance.
Quick Start & Requirements
pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main
Highlighted Details
Maintenance & Community
Licensing & Compatibility
llama.cpp
has a permissive MIT license. Compatibility for commercial use is generally good.Limitations & Caveats
2 years ago
1 day