ntransformer by xaskasdf

LLM inference engine enabling large models on consumer GPUs

Created 4 months ago

465 stars

Top 64.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

David Cournapeau

Author of scikit-learn

Project Summary

This project provides a high-efficiency LLM inference engine written in C++/CUDA, designed to run large language models on consumer-grade hardware with limited VRAM. It targets engineers and power users seeking to deploy powerful models like Llama 70B on single GPUs such as the RTX 3090, significantly lowering hardware barriers through advanced memory management and I/O techniques.

How It Works

NTransformer employs a novel 3-tier adaptive caching system (VRAM, pinned RAM, NVMe/mmap) coupled with SLEP (Streaming Layer Engine Pipeline) and an optional gpu-nvme-direct backend. This architecture streams model layers through GPU memory via PCIe, with the gpu-nvme-direct backend enabling direct NVMe I/O that bypasses the CPU entirely. This approach optimizes data movement and leverages tiered storage for substantial speedups over traditional methods. Features like layer skipping, which selectively omits redundant layers based on cosine similarity, further enhance inference performance.

Quick Start & Requirements

Primary install / run command: Build using CMake (cmake .. -DCMAKE_BUILD_TYPE=Release ...; cmake --build . -j). Run via the compiled binary ./ntransformer. A comprehensive system setup script (scripts/setup_system.sh) is provided for complex configurations.
Non-default prerequisites and dependencies: Linux (Ubuntu 6.17+ kernel tested), CUDA Toolkit 13.1, GCC 14, CMake 3.24+, NVIDIA GPU with Compute Capability 8.0+ (RTX 3090 tested). An NVMe SSD on a separate PCIe slot is required for the gpu-nvme-direct backend.
Estimated setup time or resource footprint: System setup involves multiple phases and requires careful execution due to low-level system modifications.
Links: No specific quick-start, docs, or demo links are provided beyond the README's embedded commands and scripts.

Highlighted Details

Enables running Llama 3.1 70B models on a single RTX 3090 (24GB VRAM) with 48GB RAM.
Achieves up to 83x speedup over mmap baseline for 70B models using tiered caching and NVMe direct I/O.
Supports GGUF model formats with various quantizations including Q4_0, Q8_0, Q4_K_M, Q5_K, Q6_K, F16, and F32.
Features include layer skipping (up to 20/80 layers skipped) and self-speculative decoding.
Zero external dependencies beyond the CUDA Toolkit (no PyTorch, cuBLAS).

Maintenance & Community

No explicit information regarding contributors, sponsorships, community channels (Discord/Slack), or a public roadmap is present in the provided README.

Licensing & Compatibility

The project is licensed under the BSD-2-Clause license. This permissive license is generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The gpu-nvme-direct backend necessitates significant, potentially risky, system-level modifications via an automated setup script. These include disabling IOMMU, patching NVIDIA DKMS, and binding NVMe devices to VFIO, which carry risks of system instability, boot failures, or data loss if misconfigured. Users are strongly warned against using their boot drive for NVMe direct I/O and to proceed at their own risk. The project is tested on specific hardware configurations (RTX 3090, WD SN740 NVMe).

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days