Python toolkit for on-premises LLMs applied to private data
Top 48.4% on sourcepulse
OnPrem.LLM is a Python toolkit designed to simplify the integration of on-premises Large Language Models (LLMs) with private data. It targets developers and researchers needing to apply LLMs to sensitive or locally stored information, offering a unified interface for document intelligence tasks like RAG, summarization, and few-shot classification.
How It Works
The toolkit primarily leverages llama-cpp-python
for efficient local LLM inference, supporting GGUF model formats and GPU offloading via CUDA or Metal. It also offers an alternative backend using Hugging Face Transformers, enabling broader model compatibility and easier integration with quantized models (e.g., AWQ, bitsandbytes). For data processing, it supports various PDF extraction methods, including OCR and table structure inference, and offers both dense (Chroma) and sparse vector stores for efficient document retrieval.
Quick Start & Requirements
pip install onprem
llama-cpp-python
(for CPU/GPU inference), CUDA Toolkit (for NVIDIA GPUs), or Hugging Face Transformers.llama-cpp-python
compiled with GGML_CUDA=on
(Linux) or GGML_METAL=on
(Mac).Highlighted Details
llama-cpp-python
, Hugging Face Transformers) and cloud LLMs (via LiteLLM).Maintenance & Community
The project has active development with frequent releases (v0.13.0 as of April 2025), introducing new features like streamlined Ollama/cloud LLM support and an improved Web UI.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Installation of llama-cpp-python
can be complex, especially on Windows, with recommendations to use WSL. AWQ quantization support is limited to Linux systems. The project's license is not clearly stated, which may impact commercial adoption.
3 days ago
1 day