lucebox-hub by Luce-Org

Optimized LLM inference for specific hardware

Created 2 months ago

2,378 stars

Top 18.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Piero Molino

Cofounder of Predibase

Project Summary

Summary

Lucebox-hub tackles inefficient LLM inference on consumer hardware by providing hand-tuned, chip-specific software optimizations. It targets engineers and power users aiming for high-performance, private, and cost-effective local AI deployment, maximizing existing silicon capabilities.

How It Works

Projects like "Megakernel" fuse model layers into single CUDA dispatches, eliminating CPU round-trips and using cooperative grid sync for speed. "DFlash" implements speculative decoding, featuring the first GGUF port with custom CUDA kernels for efficient state rollback, enabling high throughput and long context on consumer GPUs.

Quick Start & Requirements

Clone the repository with submodules (git clone --recurse-submodules). Project-specific setup involves pip install -e . for Megakernel or CMake builds and model downloads for DFlash. Prerequisites include an NVIDIA GPU (Ampere+), CUDA 12+, and PyTorch 2.0+; testing was performed on an RTX 3090. A pinned Luce-Org/llama.cpp@luce-dflash fork is required. Detailed writeups, benchmarks, and blog posts are indicated within each project's section.

Highlighted Details

Megakernel achieves 1.87 tok/J on a 2020 GPU, matching Apple silicon throughput at 2x.
DFlash delivers 130 tok/s on an RTX 3090 with a 128K context window, achieving 3.5x speedup over chain speculative decoding.
Focuses on rewriting software for specific chips rather than relying on hardware advancements.

Maintenance & Community

Community engagement is facilitated via Discord (discord.gg/yHfswqZmJQ) and a public issue tracker. A roadmap details future optimizations for Ryzen AI and heterogeneous systems. Further project information is available on the website (lucebox.com) and blog (lucebox.com/blog).

Licensing & Compatibility

Released under the MIT license, permitting broad use, modification, and distribution, including for commercial applications and integration into closed-source projects.

Limitations & Caveats

Optimizations are highly specific to particular hardware (NVIDIA Ampere+, RTX 3090) and models (Qwen 3.5). DFlash requires a custom llama.cpp fork and specific quantization formats (Q4_K_M GGUF) to manage memory constraints. Tuning may not transfer directly to different hardware architectures without significant rework.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

158

Issues (30d)

Star History

353 stars in the last 30 days