Discover and explore top open-source AI tools and projects—updated daily.
Luce-OrgOptimized LLM inference for specific hardware
New!
Top 94.5% on SourcePulse
Summary
Lucebox-hub tackles inefficient LLM inference on consumer hardware by providing hand-tuned, chip-specific software optimizations. It targets engineers and power users aiming for high-performance, private, and cost-effective local AI deployment, maximizing existing silicon capabilities.
How It Works
Projects like "Megakernel" fuse model layers into single CUDA dispatches, eliminating CPU round-trips and using cooperative grid sync for speed. "DFlash" implements speculative decoding, featuring the first GGUF port with custom CUDA kernels for efficient state rollback, enabling high throughput and long context on consumer GPUs.
Quick Start & Requirements
Clone the repository with submodules (git clone --recurse-submodules). Project-specific setup involves pip install -e . for Megakernel or CMake builds and model downloads for DFlash. Prerequisites include an NVIDIA GPU (Ampere+), CUDA 12+, and PyTorch 2.0+; testing was performed on an RTX 3090. A pinned Luce-Org/llama.cpp@luce-dflash fork is required. Detailed writeups, benchmarks, and blog posts are indicated within each project's section.
Highlighted Details
Maintenance & Community
Community engagement is facilitated via Discord (discord.gg/yHfswqZmJQ) and a public issue tracker. A roadmap details future optimizations for Ryzen AI and heterogeneous systems. Further project information is available on the website (lucebox.com) and blog (lucebox.com/blog).
Licensing & Compatibility
Released under the MIT license, permitting broad use, modification, and distribution, including for commercial applications and integration into closed-source projects.
Limitations & Caveats
Optimizations are highly specific to particular hardware (NVIDIA Ampere+, RTX 3090) and models (Qwen 3.5). DFlash requires a custom llama.cpp fork and specific quantization formats (Q4_K_M GGUF) to manage memory constraints. Tuning may not transfer directly to different hardware architectures without significant rework.
2 days ago
Inactive
Mega4alik
ztxz16
deepseek-ai
Dao-AILab