ntransformer  by xaskasdf

LLM inference engine enabling large models on consumer GPUs

Created 1 week ago

New!

379 stars

Top 75.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a high-efficiency LLM inference engine written in C++/CUDA, designed to run large language models on consumer-grade hardware with limited VRAM. It targets engineers and power users seeking to deploy powerful models like Llama 70B on single GPUs such as the RTX 3090, significantly lowering hardware barriers through advanced memory management and I/O techniques.

How It Works

NTransformer employs a novel 3-tier adaptive caching system (VRAM, pinned RAM, NVMe/mmap) coupled with SLEP (Streaming Layer Engine Pipeline) and an optional gpu-nvme-direct backend. This architecture streams model layers through GPU memory via PCIe, with the gpu-nvme-direct backend enabling direct NVMe I/O that bypasses the CPU entirely. This approach optimizes data movement and leverages tiered storage for substantial speedups over traditional methods. Features like layer skipping, which selectively omits redundant layers based on cosine similarity, further enhance inference performance.

Quick Start & Requirements

  • Primary install / run command: Build using CMake (cmake .. -DCMAKE_BUILD_TYPE=Release ...; cmake --build . -j). Run via the compiled binary ./ntransformer. A comprehensive system setup script (scripts/setup_system.sh) is provided for complex configurations.
  • Non-default prerequisites and dependencies: Linux (Ubuntu 6.17+ kernel tested), CUDA Toolkit 13.1, GCC 14, CMake 3.24+, NVIDIA GPU with Compute Capability 8.0+ (RTX 3090 tested). An NVMe SSD on a separate PCIe slot is required for the gpu-nvme-direct backend.
  • Estimated setup time or resource footprint: System setup involves multiple phases and requires careful execution due to low-level system modifications.
  • Links: No specific quick-start, docs, or demo links are provided beyond the README's embedded commands and scripts.

Highlighted Details

  • Enables running Llama 3.1 70B models on a single RTX 3090 (24GB VRAM) with 48GB RAM.
  • Achieves up to 83x speedup over mmap baseline for 70B models using tiered caching and NVMe direct I/O.
  • Supports GGUF model formats with various quantizations including Q4_0, Q8_0, Q4_K_M, Q5_K, Q6_K, F16, and F32.
  • Features include layer skipping (up to 20/80 layers skipped) and self-speculative decoding.
  • Zero external dependencies beyond the CUDA Toolkit (no PyTorch, cuBLAS).

Maintenance & Community

No explicit information regarding contributors, sponsorships, community channels (Discord/Slack), or a public roadmap is present in the provided README.

Licensing & Compatibility

The project is licensed under the BSD-2-Clause license. This permissive license is generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The gpu-nvme-direct backend necessitates significant, potentially risky, system-level modifications via an automated setup script. These include disabling IOMMU, patching NVIDIA DKMS, and binding NVMe devices to VFIO, which carry risks of system instability, boot failures, or data loss if misconfigured. Users are strongly warned against using their boot drive for NVMe direct I/O and to proceed at their own risk. The project is tested on specific hardware configurations (RTX 3090, WD SN740 NVMe).

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
1
Star History
381 stars in the last 7 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 17 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

9.5%
13k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 5 months ago
Feedback? Help us improve.