dotLLM by kkokosa

Natively built .NET LLM inference engine

Created 2 months ago

402 stars

Top 71.8% on SourcePulse

Project Summary

Summary

dotLLM provides a high-performance LLM inference engine natively implemented in C#/.NET, targeting developers seeking efficient LLM integration. It bypasses Python/C++ wrappers, offering significant speedups via SIMD-optimized CPU and CUDA GPU backends, making advanced LLM capabilities accessible within the .NET ecosystem.

How It Works

This engine is built ground-up in pure C#, leveraging System.Runtime.Intrinsics for SIMD CPU operations and PTX kernels via the CUDA Driver API for GPU acceleration. It supports transformer models like Llama, Mistral, and Phi. Key design choices include zero-GC inference using unmanaged memory, memory-mapped GGUF loading for millisecond model startup, and a modular NuGet package structure. This native approach avoids the overhead of inter-process communication or foreign function interfaces.

Quick Start & Requirements

Installation options:

Global .NET tool: dotnet tool install -g DotLLM.Cli --prerelease.
Self-contained binaries from releases.
NuGet packages for .NET applications.

Requires .NET 10 runtime. GPU acceleration needs an NVIDIA GPU and CUDA Toolkit. Python 3.10+ is used for scripts. Website: https://dotllm.dev/ Documentation: docs/ Roadmap: docs/ROADMAP.md Discussions: https://github.com/kkokosa/dotLLM/discussions

Highlighted Details

Zero-GC Inference: Uses unmanaged memory (NativeMemory.AlignedAlloc) for tensors, avoiding managed heap allocations.
Performance: Features SIMD vectorization, fused operators, and efficient attention mechanisms.
Rapid Loading: Memory-mapped GGUF loading enables multi-GB models to load in milliseconds.
Model Support: Handles Llama, Mistral, Phi, Qwen, DeepSeek with various quantizations (FP16, Q8_0, Q4_K_M).
OpenAI-Compatible API: Includes an ASP.NET Core server with a chat UI.
Advanced Features: Supports paged KV-cache, speculative decoding, and constrained decoding (JSON, schema, regex, grammar).
Hybrid Compute: Offers CPU/GPU layer offloading and NUMA/P-core aware threading.

Maintenance & Community

Authored by .NET MVP Konrad Kokosa. Community engagement via GitHub Discussions. A detailed roadmap is available.

Licensing & Compatibility

Licensed under GNU General Public License v3.0 (GPL v3). This strong copyleft license may impact commercial use or integration into proprietary software.

Limitations & Caveats

Native AOT builds are experimental. Speculative decoding is currently greedy-only. GPL v3 license has copyleft implications. Continuous batching and advanced scheduling are planned for future releases.

dotLLM by kkokosa

Explore Similar Projects

dash-infer by modelscope

inferrs by ericcurtin

aikit by kaito-project

JetStream by AI-Hypercomputer

xFasterTransformer by intel

candle-vllm by EricLBuehler

JittorLLMs by Jittor

picolm by RightNow-AI

distributed-llama by b4rtaz

LiteRT-LM by google-ai-edge

mistral.rs by EricLBuehler

llamafile by mozilla-ai