Discover and explore top open-source AI tools and projects—updated daily.
xaskasdfLLM inference engine enabling large models on consumer GPUs
New!
Top 75.4% on SourcePulse
This project provides a high-efficiency LLM inference engine written in C++/CUDA, designed to run large language models on consumer-grade hardware with limited VRAM. It targets engineers and power users seeking to deploy powerful models like Llama 70B on single GPUs such as the RTX 3090, significantly lowering hardware barriers through advanced memory management and I/O techniques.
How It Works
NTransformer employs a novel 3-tier adaptive caching system (VRAM, pinned RAM, NVMe/mmap) coupled with SLEP (Streaming Layer Engine Pipeline) and an optional gpu-nvme-direct backend. This architecture streams model layers through GPU memory via PCIe, with the gpu-nvme-direct backend enabling direct NVMe I/O that bypasses the CPU entirely. This approach optimizes data movement and leverages tiered storage for substantial speedups over traditional methods. Features like layer skipping, which selectively omits redundant layers based on cosine similarity, further enhance inference performance.
Quick Start & Requirements
cmake .. -DCMAKE_BUILD_TYPE=Release ...; cmake --build . -j). Run via the compiled binary ./ntransformer. A comprehensive system setup script (scripts/setup_system.sh) is provided for complex configurations.gpu-nvme-direct backend.Highlighted Details
Maintenance & Community
No explicit information regarding contributors, sponsorships, community channels (Discord/Slack), or a public roadmap is present in the provided README.
Licensing & Compatibility
The project is licensed under the BSD-2-Clause license. This permissive license is generally compatible with commercial use and closed-source linking.
Limitations & Caveats
The gpu-nvme-direct backend necessitates significant, potentially risky, system-level modifications via an automated setup script. These include disabling IOMMU, patching NVIDIA DKMS, and binding NVMe devices to VFIO, which carry risks of system instability, boot failures, or data loss if misconfigured. Users are strongly warned against using their boot drive for NVMe direct I/O and to proceed at their own risk. The project is tested on specific hardware configurations (RTX 3090, WD SN740 NVMe).
3 days ago
Inactive
zhihu
Mega4alik
ztxz16
ai-dynamo
lyogavin