Bonsai-demo  by PrismML-Eng

Run large language models locally with optimized backends

Created 2 weeks ago

New!

456 stars

Top 66.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a streamlined demo and setup process for running Bonsai language models locally across diverse hardware. It targets engineers and researchers seeking an accessible way to deploy LLMs on Mac (Metal, Apple Silicon), Linux, and Windows (CUDA/CPU), offering a unified interface for model inference.

How It Works

The project integrates two primary inference backends: llama.cpp for broad cross-platform compatibility (GGUF format) and MLX for optimized performance on Apple Silicon (MLX format). Crucially, it utilizes custom forks of these projects (PrismML-Eng/llama.cpp, PrismML-Eng/mlx) to incorporate necessary inference kernels not yet available in their upstream versions, enabling immediate functionality.

Quick Start & Requirements

  • Primary install / run command: ./setup.sh (macOS/Linux) or .\setup.ps1 (Windows).
  • Non-default prerequisites and dependencies:
    • macOS: Xcode CLT.
    • Linux: build-essential.
    • Windows: Visual Studio Build Tools (for building from source).
    • Python: Managed via uv and a virtual environment.
    • CUDA toolkit: Required for Linux/Windows GPU acceleration.
    • PRISM_HF_TOKEN: Required for downloading models from private HuggingFace repositories.
  • Estimated setup time or resource footprint: The setup.sh/setup.ps1 script automates dependency installation, environment setup, model downloads, and binary acquisition/compilation, implying a comprehensive initial setup.
  • Links: Bonsai Demo Website, HuggingFace Collection, Whitepaper, GitHub, Discord (no direct URLs provided in README).

Highlighted Details

  • Offers Bonsai models in three sizes: 8B, 4B, and 1.7B.
  • Supports both GGUF (for llama.cpp) and MLX (for MLX) model formats.
  • The 8B model supports context lengths up to 65,536 tokens, with dynamic KV cache sizing.
  • Includes scripts for direct inference, running a local chat server, and integration with Open WebUI.

Maintenance & Community

The project maintains custom forks of llama.cpp and MLX, suggesting active development on these core components. Community support is available via Discord.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README, which is a critical omission for due diligence. The project is designed for local execution on macOS (Apple Silicon, Metal), Linux (CUDA, CPU), and Windows (CUDA).

Limitations & Caveats

Requires custom forks of llama.cpp and MLX due to missing upstream kernels. Model downloads necessitate a PRISM_HF_TOKEN, indicating reliance on private HuggingFace repositories. The comprehensive setup script may require careful monitoring on diverse system configurations.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
21
Star History
459 stars in the last 17 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

torchchat by pytorch

0.1%
4k
PyTorch-native SDK for local LLM inference across diverse platforms
Created 2 years ago
Updated 7 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anil Dash Anil Dash(Former CEO of Glitch), and
23 more.

llamafile by mozilla-ai

0.5%
24k
Single-file LLM distribution and runtime via `llama.cpp` and Cosmopolitan Libc
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.