lucebox-hub  by Luce-Org

Optimized LLM inference for specific hardware

Created 2 weeks ago

New!

273 stars

Top 94.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

Lucebox-hub tackles inefficient LLM inference on consumer hardware by providing hand-tuned, chip-specific software optimizations. It targets engineers and power users aiming for high-performance, private, and cost-effective local AI deployment, maximizing existing silicon capabilities.

How It Works

Projects like "Megakernel" fuse model layers into single CUDA dispatches, eliminating CPU round-trips and using cooperative grid sync for speed. "DFlash" implements speculative decoding, featuring the first GGUF port with custom CUDA kernels for efficient state rollback, enabling high throughput and long context on consumer GPUs.

Quick Start & Requirements

Clone the repository with submodules (git clone --recurse-submodules). Project-specific setup involves pip install -e . for Megakernel or CMake builds and model downloads for DFlash. Prerequisites include an NVIDIA GPU (Ampere+), CUDA 12+, and PyTorch 2.0+; testing was performed on an RTX 3090. A pinned Luce-Org/llama.cpp@luce-dflash fork is required. Detailed writeups, benchmarks, and blog posts are indicated within each project's section.

Highlighted Details

  • Megakernel achieves 1.87 tok/J on a 2020 GPU, matching Apple silicon throughput at 2x.
  • DFlash delivers 130 tok/s on an RTX 3090 with a 128K context window, achieving 3.5x speedup over chain speculative decoding.
  • Focuses on rewriting software for specific chips rather than relying on hardware advancements.

Maintenance & Community

Community engagement is facilitated via Discord (discord.gg/yHfswqZmJQ) and a public issue tracker. A roadmap details future optimizations for Ryzen AI and heterogeneous systems. Further project information is available on the website (lucebox.com) and blog (lucebox.com/blog).

Licensing & Compatibility

Released under the MIT license, permitting broad use, modification, and distribution, including for commercial applications and integration into closed-source projects.

Limitations & Caveats

Optimizations are highly specific to particular hardware (NVIDIA Ampere+, RTX 3090) and models (Qwen 3.5). DFlash requires a custom llama.cpp fork and specific quantization formats (Q4_K_M GGUF) to manage memory constraints. Tuning may not transfer directly to different hardware architectures without significant rework.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
1
Star History
276 stars in the last 16 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
13k
Efficient CUDA kernels for MLA decoding
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.5%
23k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 19 hours ago
Feedback? Help us improve.