fastLLaMa  by PotatoSpudowski

High-performance framework for running decoder-only LLMs with 4-bit quantization

created 2 years ago
410 stars

Top 72.3% on sourcepulse

GitHubView on GitHub
Project Summary

fastLLaMa provides a high-performance Python interface to the llama.cpp C++ library for running decoder-only LLMs. It targets developers needing efficient LLM deployment, offering features like 4-bit quantization, custom logging, and session state management for scalable production workflows.

How It Works

This framework leverages a C++ backend (llama.cpp) for core LLM operations, wrapped by a user-friendly Python API. It supports 4-bit quantization for reduced memory footprint and faster inference. Key features include in-memory system prompts, customizable logging, session state saving/loading, and dynamic LoRA adapter attachment/detachment with caching for performance.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main
  • Requirements: CMake, GCC 11+, C++17, Python 3.x.
  • Model conversion scripts are provided for LLaMA and Alpaca models.
  • See examples/python/ for detailed usage.

Highlighted Details

  • Supports LLaMA, Alpaca, GPT4All, and other derived models.
  • Dynamic LoRA adapter switching and quantization with caching.
  • Session state management (save/load/reset).
  • Custom logger implementation.
  • WebSocket server and Web UI planned.

Maintenance & Community

  • Active development with a roadmap including GPU Int4 support and multi-language bindings.
  • Contributions are welcomed via PRs.

Licensing & Compatibility

  • The project appears to use the MIT license, but the underlying llama.cpp has a permissive MIT license. Compatibility for commercial use is generally good.

Limitations & Caveats

  • Currently experimental, with features like NVIDIA GPU Int4 support, Windows, and Android support pending.
  • Models are fully loaded into RAM, requiring significant memory based on model size (e.g., 7B model requires ~3.9 GB quantized).
Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 18 hours ago
Feedback? Help us improve.